a novel approach for breast cancer detection using data mining techniques
DESCRIPTION
TRANSCRIPT
04072023 AAST-Comp eng 1
A Novel Approach for Breast Cancer Detection usingData Mining Techniques
Presented bybull Ahmed Abd Elhafeez
04072023AAST-Comp eng2
AGENDA Scientific and Medical Background1 What is cancer2 Breast cancer3 History and Background4 Pattern recognition system
decomposition5 About data mining6 Data mining tools 7 Classification Techniques
04072023AAST-Comp eng3
AGENDA (Cont) Paper contents1 Introduction2 Related Work3 Classification Techniques4 Experiments and Results5 Conclusion6 References
04072023
What Is Cancer Cancer is a term used for diseases in which
abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems
Cancer is not just one disease but many diseases There are more than 100 different types of cancer
Most cancers are named for the organ or type of cell in which they start
There are two general types of cancer tumours namelybull benignbull malignant
4 AAST-Comp eng
Skin cancer
Breast cancerColon cancer
Lung cancer
Pancreatic cancer
Liver cancer
Bladder cancer
Prostate Cancer
Kidney cancerThyroid Cancer
Leukemia Cancer
Edometrial Cancer
Rectal Cancer
Non-Hodgkin LymphomaCervical cancer
Thyroid Cancer
Oral cancer
AAST-Comp eng 504072023
Breast Cancer
6
bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer
bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually
bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind
AAST-Comp eng04072023
History and Background
Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment
7AAST-Comp eng04072023
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng3
AGENDA (Cont) Paper contents1 Introduction2 Related Work3 Classification Techniques4 Experiments and Results5 Conclusion6 References
04072023
What Is Cancer Cancer is a term used for diseases in which
abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems
Cancer is not just one disease but many diseases There are more than 100 different types of cancer
Most cancers are named for the organ or type of cell in which they start
There are two general types of cancer tumours namelybull benignbull malignant
4 AAST-Comp eng
Skin cancer
Breast cancerColon cancer
Lung cancer
Pancreatic cancer
Liver cancer
Bladder cancer
Prostate Cancer
Kidney cancerThyroid Cancer
Leukemia Cancer
Edometrial Cancer
Rectal Cancer
Non-Hodgkin LymphomaCervical cancer
Thyroid Cancer
Oral cancer
AAST-Comp eng 504072023
Breast Cancer
6
bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer
bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually
bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind
AAST-Comp eng04072023
History and Background
Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment
7AAST-Comp eng04072023
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023
What Is Cancer Cancer is a term used for diseases in which
abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems
Cancer is not just one disease but many diseases There are more than 100 different types of cancer
Most cancers are named for the organ or type of cell in which they start
There are two general types of cancer tumours namelybull benignbull malignant
4 AAST-Comp eng
Skin cancer
Breast cancerColon cancer
Lung cancer
Pancreatic cancer
Liver cancer
Bladder cancer
Prostate Cancer
Kidney cancerThyroid Cancer
Leukemia Cancer
Edometrial Cancer
Rectal Cancer
Non-Hodgkin LymphomaCervical cancer
Thyroid Cancer
Oral cancer
AAST-Comp eng 504072023
Breast Cancer
6
bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer
bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually
bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind
AAST-Comp eng04072023
History and Background
Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment
7AAST-Comp eng04072023
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Skin cancer
Breast cancerColon cancer
Lung cancer
Pancreatic cancer
Liver cancer
Bladder cancer
Prostate Cancer
Kidney cancerThyroid Cancer
Leukemia Cancer
Edometrial Cancer
Rectal Cancer
Non-Hodgkin LymphomaCervical cancer
Thyroid Cancer
Oral cancer
AAST-Comp eng 504072023
Breast Cancer
6
bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer
bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually
bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind
AAST-Comp eng04072023
History and Background
Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment
7AAST-Comp eng04072023
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Breast Cancer
6
bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer
bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually
bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind
AAST-Comp eng04072023
History and Background
Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment
7AAST-Comp eng04072023
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
History and Background
Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment
7AAST-Comp eng04072023
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Breast Cancer Classification
8AAST-Comp eng
Round well-defined larger groups are more likely benign
Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant
Suspicious pixels groups show up as white spots on a mammogram
04072023
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash
features that turned out to be cancer used for diagnosis prognosis of each cell nucleus
9AAST-Comp eng
F2Magnetic Resonance Image
F1
F3
Fn
Feature
Extraction
04072023
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Diagnosis or prognosis
Brest CancerBenign
Malignant
AAST-Comp eng 1004072023
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023 AAST-Comp eng 11
Computer-Aided Diagnosis
bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage
bull Radiologists misdiagnose 10-30 of the malignant cases
bull Of the cases sent for surgical biopsy only 10-20 are actually malignant
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Computational Intelligence
Computational IntelligenceData + Knowledge
Artificial Intelligence
Expert systems
Fuzzylogic
PatternRecognition
Machinelearning
Probabilistic methods
Multivariatestatistics
Visuali-zation
Evolutionaryalgorithms
Neuralnetworks
04072023 AAST-Comp eng 12
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
What do these methods do
bull Provide non-parametric models of databull Allow to classify new data to pre-defined
categories supporting diagnosis amp prognosis
bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy
or crisp logical rulesbull Help to visualize multi-dimensional
relationships among data samples 04072023 AAST-Comp eng 13
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
14
Feature selection
Data Preprocessing
Selecting Data mining tool dataset
Classification algorithm
SMO IBK BF TREE
Results and evaluationsAAST-Comp eng
Pattern recognition system decomposition
04072023
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Results
Data preprocessing
Feature selectionClassification
Selection tool data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
data sets
AAST-Comp eng 1604072023
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
results
Data preprocessing
Feature selectionclassification
Selection tool datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
AAST-Comp eng 18
Data Mining
bull Data Mining is set of techniques used in various domains to give meaning to the available data
bull Objective Fit data to a modelndashDescriptivendashPredictive
04072023
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Predictive amp descriptive data mining
bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples
bull Descriptive Is to describe the general or special features of a set of data in a concise manner
AAST-Comp eng 1904072023
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
AAST-Comp eng 20
Data Mining Models and Tasks
04072023
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Data mining Tools
Many advanced tools for data mining are available either as open-source or commercial software
21AAST-Comp eng04072023
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for
data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code
bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes
bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature
04072023 AAST-Comp eng 22
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Results
Data preprocessing
Feature selection Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Data Preprocessing
bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes
of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names
bull Quality decisions must be based on quality data measures
Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility
AAST-Comp eng 2404072023
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Preprocessing techniques
bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and
resolve inconsistencies
bull Data integrationndash Integration of multiple databases data cubes or files
bull Data transformationndash Normalization and aggregation
bull Data reductionndash Obtains reduced representation in volume but produces the same or
similar analytical results
bull Data discretizationndash Part of data reduction but with particular importance especially for
numerical data
AAST-Comp eng 2504072023
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Results
Data preprocessing
Feature selection
Classification
Selection tool datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Finding a feature subset that has the most discriminative information from the original feature space
The objective of feature selection is bull Improving the prediction performance of the
predictorsbull Providing a faster and more cost-effective
predictorsbull Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 2704072023
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Feature Selection
bull Transforming a dataset by removing some of its columns
A1 A2 A3 A4 C A2 A4 C
04072023 AAST-Comp eng 28
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Supervised Learningbull Supervision The training data (observations measurements etc) are
accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories
AAST-Comp eng
Category ldquoArdquo
Category ldquoBrdquoClassification (Recognition) (Supervised Classification)
3004072023
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classificationbull Everyday all the time we classify
thingsbull Eg crossing the street
ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not
04072023 AAST-Comp eng 31
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023 AAST-Comp eng 32
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction models continuous-valued functions ie
predicts unknown or missing values
Classification vs Prediction
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023 AAST-Comp eng 33
ClassificationmdashA Two-Step Process
Model construction describing a set of predetermined classes
Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules decision trees or mathematical formulae
Model usage for classifying future or unknown objects Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
If the accuracy is acceptable use the model to classify data tuples whose class labels are not known
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023 AAST-Comp eng 34
Classification Process (1) Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo
Classifier(Model)
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023 AAST-Comp eng 35
Classification Process (2) Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff Professor 4)
Tenured
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classificationbull is a data mining (machine learning) technique used to
predict group membership for data instances bull Classification analysis is the organization of data in
given classbull These approaches normally use a training set where
all objects are already associated with known class labels
bull The classification algorithm learns from the training set and builds a model
bull Many classification models are used to classify new objects
AAST-Comp eng 3604072023
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classification
bull predicts categorical class labels (discrete or nominal)
bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data
AAST-Comp eng 3704072023
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Quality of a classifierbull Quality will be calculated with respect to lowest
computing timebull Quality of certain model one can describe by confusion
matrix bull Confusion matrix shows a new entry properties
predictive ability of the method bull Row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class
bull Thus the diagonal elements represent correctly classified compounds
bull the cross-diagonal elements represent misclassified compounds
AAST-Comp eng 3804072023
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classification Techniques
Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research
The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data
04072023AAST-Comp eng39
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classification Techniques
classification
Techniques
Naiumlve Bays
SVM
C45
KNN
BF tree
IBK
40 04072023AAST-Comp eng
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classification ModelSupport vector machine
Classifier
V Vapnik
04072023 AAST-Comp eng 41
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Support Vector Machine (SVM) SVM is a state-of-the-art learning machine
which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
04072023AAST-Comp eng42
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Support Vector Machine (SVM)
04072023AAST-Comp eng43
SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc
due to its generalization ability and has found a great deal of success in many applications
Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Tennis example
Humidity
Temperature
= play tennis= do not play tennis
04072023 AAST-Comp eng 44
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Linear classifiers Which Hyperplane
bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane
but not the optimal one bull Support Vector Machine (SVM) finds an
optimal solutionndash Maximizes the distance between the
hyperplane and the ldquodifficult pointsrdquo close to decision boundary
ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions
45
This line represents the
decision boundary
ax + by minus c = 0
Ch 15
04072023 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Selection of a Good Hyper-Plane
Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data
04072023 AAST-Comp eng 46
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
SVM ndash Support Vector Machines
Support VectorsSmall Margin Large Margin
04072023 AAST-Comp eng 47
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Support Vector Machine (SVM)
bull SVMs maximize the margin around the separating hyperplane
bull The decision function is fully specified by a subset of training samples the support vectors
bull Solving SVMs is a quadratic programming problem
bull Seen by many as the most successful current text classification method
48
Support vectors
Maximizesmargin
Sec 151
Narrowermargin
04072023 AAST-Comp eng
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Non-Separable Case
04072023 AAST-Comp eng 49
The Lagrangian trick
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
SVM SVM
Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode
using quadratic programming techniques Using kernels can learn very complex
functions
04072023 AAST-Comp eng 51
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Classification ModelK-Nearest Neighbor
Classifier04072023 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
K-Nearest Neighbor Classifier
Learning by analogyTell me who your friends are and Irsquoll
tell you who you areA new example is assigned to the
most common class among the (K) examples that are most similar to it
04072023 AAST-Comp eng 53
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
K-Nearest Neighbor Algorithm To determine the class of a new example
E Calculate the distance between E and all
examples in the training set Select K-nearest examples to E in the training
set Assign E to the most common class among its
K-nearest neighbors
Response
ResponseNo response
No response
No response
Class Response04072023 AAST-Comp eng 54
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Each example is represented with a set of numerical attributes
ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and
Y =(y1y2 y3hellipyn) is defined as
Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
iii yxYXD
1
2)()(
JohnAge=35Income=95KNo of credit cards=3
Rachel Age=41Income=215KNo of credit cards=2
Distance Between Neighbors
04072023 AAST-Comp eng 55
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance
must be classified
Response
Response No response
No response
No response
Class Respond04072023 AAST-Comp eng 56
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Example 3-Nearest Neighbors
Customer Age
Income
No credit cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 04072023 AAST-Comp eng 57
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Customer Age
Income (K)
No cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
ResponseNo
Yes
No
No
Yes
Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes
04072023 AAST-Comp eng 58
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest
neighbors
Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than
with a model (need to calculate and compare distance from new example to all other examples)
04072023 AAST-Comp eng 59
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Decision Tree
04072023 AAST-Comp eng 60
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned
ndash The decision tree can be thought of as a set sentences written propositional logic
04072023 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Example
Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do
04072023 AAST-Comp eng 62
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Payouts and Probabilitiesbull Movie company Payouts
ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000
bull TV Network Payoutndash Flat rate - $900000
bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01
04072023 AAST-Comp eng 63
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
AAST-Comp eng 6404072023
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Company $200000 $1000000 $3000000
Sign with TV Network $900000 $900000 $900000
PriorProbabilities
03 06 01
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Using Expected Return Criteria
EVmovie=03(200000)+06(1000000)+01(3000000)
= $960000 = EVUII or EVBest
EVtv =03(900000)+06(900000)+01(900000)
= $900000
Therefore using this criteria Jenny should select the movie contract
04072023 AAST-Comp eng 65
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Decision Treesbull Three types of ldquonodesrdquo
ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)
bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes
bull Create the tree from left to right bull Solve the tree from right to left
04072023 AAST-Comp eng 66
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Example Decision Tree
Decision node
Chance node
Decision 1
Decision 2
Event 1
Event 2
Event 3
04072023 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
04072023 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER
ER
ER
04072023 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co
Sign with TV Network
$200000
$1000000
$3000000
$900000
$900000
$900000
3
6
1
3
6
1
ER900000
ER960000
ER960000
04072023 AAST-Comp eng 70
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Results
Data preprocessing
Feature selection
Classification
Selection tool data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp tn
AAST-Comp eng 7204072023
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Cross-validation
bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie
ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average
04072023 AAST-Comp eng 73
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng74
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng
Abstract The aim of this paper is to investigate the
performance of different classification techniques
Aim is developing accurate prediction models for breast cancer using data mining techniques
Comparing three classification techniques in Weka software and comparison results
Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods
75
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
0407202376
Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes
such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest
wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back
AAST-Comp eng
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023
Risk factors
Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have
a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods
77 AAST-Comp eng
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng
Risk factors
Breast radiation early in life Treatment with DES the drug DES
(diethylstilbestrol) during pregnancy Not having children or having them later in
life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese
78
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
0407202379
BACKGROUND Bittern et al used artificial neural network to
predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation
Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients
Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability
AAST-Comp eng
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023
BACKGROUND Bellaachi et al used naive bayes decision tree
and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years
Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients
80 AAST-Comp eng
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023
BACKGROUND Vikas Chaurasia et al used CART (Classification and
Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients
Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45
Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry
81 AAST-Comp eng
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023
BACKGROUND Dr SVijayarani et al analyses the
performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization
82 AAST-Comp eng
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng
BACKGROUND Kaewchinporn Clsquos presented a new classification
algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain
BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification
83
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng84
BREAST-CANCER-WISCONSIN DATA SET
SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison
collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued
attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from
the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241
(345) Note 2 malignant and 14 benign excluded hence
percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)
04072023AAST-Comp eng85
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023 AAST-Comp eng 86
Attribute DomainSample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign 4 For Malignant
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
0407202387
EVALUATION METHODS We have used the Weka (Waikato Environment for
Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms
for data mining tasks The algorithms can either be applied directly to a
dataset or called from your own Java code WEKA contains tools for data preprocessing
classification regression clustering association rules visualization and feature selection
It is also well suited for developing new machine learning schemes
WEKA is open source software issued under the GNU General Public License
AAST-Comp eng
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
EXPERIMENTAL RESULTS
88 04072023AAST-Comp eng
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
EXPERIMENTAL RESULTS
89 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
importance of the input variables
04072023AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683
Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683
Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683
Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
EXPERIMENTAL RESULTS
91 04072023AAST-Comp eng
Evaluation Criteria
Classifiers
BF TREE IBK SMO
Timing To Build Model (In Sec)
097 002 033
Correctly Classified Instances
652 655 657
Incorrectly Classified Instances
31 28 26
Accuracy () 9546 9590 9619
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined
by TP (TP + FN) the specificity or the true negative rate (TNR) is
defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +
FN) True positive (TP) = number of positive samples
correctly predicted False negative (FN) = number of positive samples
wrongly predicted False positive (FP) = number of negative samples
wrongly predicted as positive True negative (TN) = number of negative samples
correctly predicted92 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
EXPERIMENTAL RESULTSClassifi
erTP FP Precisio
nRecall Class
BF Tree
0971 0075 096 0971 Benign
0925 0029 0944 0925 Malignant
IBK
098 0079 0958 098 Benign
0921 002 0961 0921 Malignant
SMO
0971 0054 0971 0971 Benign
0946 0029 0946 0946 Malignant
93 04072023AAST-Comp eng
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
EXPERIMENTAL RESULTSClassifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 04072023AAST-Comp eng
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
importance of the input variables
04072023AAST-Comp eng95
variable Chi-squared
Info Gain
Gain Ratio
Average Rank IMPORTANCE
Clump Thickness 37808158 0464 0152 12623252
6 8Uniformity of
Cell Size
53979308 0702 03 180265026 1
Uniformity of Cell Shape 52307097 0677 0272
17467332
32
Marginal Adhesion 3900595 0464 021 1302445 7
Single Epithelial Cell Size
44786118 0534 0233
149542726
5
Bare Nuclei 48900953 0603 0303
163305176
3
Bland Chromatin 45320971 0555 0201
15132190
34
Normal Nucleoli 41663061 0487 0237
13911820
36
Mitoses 1919682 0212 0212 64122733 9
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023AAST-Comp eng96
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
0407202397
CONCLUSION the accuracy of classification techniques is
evaluated based on the selected classifier algorithm
we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree
The performance of SMO shows the high level compare with other classifiers
most important attributes for breast cancer survivals are Uniformity of Cell Size
AAST-Comp eng
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
0407202398
Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques
AAST-Comp eng
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions
04072023AAST-Comp eng99
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
comparison Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written International Journal of Computer and
Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012
Paper introduced more advanced idea and make a fusion between classifiers
04072023AAST-Comp eng100
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
References
101AAST-Comp eng
[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)
[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
AAST-Comp eng 102
[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996
04072023
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994
AAST-Comp eng 10304072023
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
AAST-Comp eng 104
[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185
04072023
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-
04072023105
Thank you
AAST-Comp eng
- A Novel Approach for Breast Cancer Detection using Data Mining
- AGENDA
- AGENDA (Cont)
- What Is Cancer
- Slide 5
- Breast Cancer
- History and Background
- Breast Cancer Classification
- Breast cancerrsquos Features
- Diagnosis or prognosis
- Computer-Aided Diagnosis
- Computational Intelligence
- What do these methods do
- Pattern recognition system decomposition
- Slide 15
- data sets
- Slide 17
- Data Mining
- Predictive amp descriptive data mining
- Data Mining Models and Tasks
- Data mining Tools
- weka
- Slide 23
- Data Preprocessing
- Preprocessing techniques
- Slide 26
- Slide 27
- Feature Selection
- Slide 29
- Supervised Learning
- Classification
- Classification vs Prediction
- ClassificationmdashA Two-Step Process
- Classification Process (1) Model Construction
- Classification Process (2) Use the Model in Prediction
- Classification (2)
- Classification
- Quality of a classifier
- Classification Techniques
- Classification Techniques (2)
- Classification Model
- Support Vector Machine (SVM)
- Support Vector Machine (SVM) (2)
- Tennis example
- Linear classifiers Which Hyperplane
- Selection of a Good Hyper-Plane
- SVM ndash Support Vector Machines
- Support Vector Machine (SVM) (3)
- Non-Separable Case
- SVM
- Classification Model (2)
- Slide 53
- K-Nearest Neighbor Algorithm
- Slide 55
- Instance Based Learning
- Example 3-Nearest Neighbors
- Slide 58
- Slide 59
- Decision Tree
- Slide 61
- Example
- Payouts and Probabilities
- Jenny Lind - Payoff Table
- Using Expected Return Criteria
- Decision Trees
- Example Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree
- Jenny Lind Decision Tree - Solved
- Slide 71
- Evaluation Metrics
- Cross-validation
- A Novel Approach for Breast Cancer Detection using Data Mining
- Abstract
- Introduction
- Risk factors
- Risk factors (2)
- BACKGROUND
- BACKGROUND (2)
- BACKGROUND (3)
- BACKGROUND (4)
- BACKGROUND (5)
- Slide 84
- BREAST-CANCER-WISCONSIN DATA SET SUMMARY
- Slide 86
- EVALUATION METHODS
- EXPERIMENTAL RESULTS
- EXPERIMENTAL RESULTS (2)
- importance of the input variables
- EXPERIMENTAL RESULTS (3)
- EXPERIMENTAL RESULTS (4)
- EXPERIMENTAL RESULTS (5)
- EXPERIMENTAL RESULTS (6)
- importance of the input variables (2)
- Slide 96
- CONCLUSION
- Future work
- Notes on paper
- comparison
- References
- Slide 102
- Slide 103
- Slide 104
- Slide 105
-