classification of chestnuts with feature selection by noise resilient classifiers
TRANSCRIPT
Classification of
chestnuts with
feature selection by
noise resilient
classifiers
Elena Roglia
Rossella Cancelliere
Rosa Meo
University of Turin – Department of Computer Science – Italy
Classifying chestnut plants
according to their place of origin
Which features ?
Which classifiers ?
What to do with noise?
Prediction of chestnut origin from
their properties is important!!
Industrial
applications
For example:
verification of
certificates of
product origin.
Papaya from
Italy?!!
Think before eat
!!
Why feature selection?
Botanic features are
extracted collected and
stored in a data set by
human agents.
The process is:
- lengthy
- costly
- error prone
Its nodes perform a
test on a data attribute:
the outcome partitions
the training set into
smaller partitions until
the class value becomes
homogeneous .
The class value is the
prediction of the
decision tree for the
set of data records
that reach that final
node.
Decision tree
C4.5 algorithm induces the form of the decision tree
using entropy of the class value.
It grows the tree until the entropy reduction falls under
a user-defined threshold.
Entropy
reduction
Tree induction by entropy
Random forest
Large number of decision
trees is grown.
Each tree depends on the
values of a random vector .
Predictions are usually
combined using the technique
of majority voting so that the
most popular class among
them is predicted.
is a special ensemble
learner
Which classifiers?
Initial data set
1600 Samples
19 Features from fruit
peculiarities
MLP RBF C4.5 RF SMO
58.12% 47.97% 49.81% 55.06% 52.50%
NEW
FEATURES
?
New data set
1600 Samples
37 Features from fruit and
plant peculiarities
Symmetrical Uncertainty,
Chi-Square Statistic,
Gain Ratio, Information
Gain, oneR
methodology…….
FINAL DATA SET
1600 Samples
6 Features selected by
entropy-based information
gain criterion
# of chestnuts/kg
Diameter of the trunk
# female inflorescence/ament
Ament length
Length of the leaf limb,
Height of the plant.
Some details……..
Random forest forest of trees built with
all 6 features and trained each on a
different training set built with random
selection of samples with replacement.
Training set (70%) 1120
Test set (30%) 480
Target class 8
MLP 6,12 and 8 neurons (one for each
geographic zone) respectively in input,
hidden and output layers. Training phase of
100 iterations.
Decision tree
Used default settings
of Weka tool.
Obtained a binary tree
of 15 leaves and 6
levels.
ACCURACIES ON TEST SET
MLP 97.91%
C4.5 100%
RF 100%
i(A)η0.05i(A)(A)i'
11
WHICH ARE THE
PERFORMANCE
IN NOISY DATA
SET?
i(A) is the value of attribute A in the i-th
instance and
ARE THESE
CLASSIFIERS
ROBUST?
11
)A(i05.0)A(i)A('i
ACCURACIES ON TEST SET
WITHOUT
NOISE
WITH
NOISE
MLP 97.91% 93.12%
C4.5 100% 82.29%
RF 100% 90.62%
+ DECISION TREE
∆ RANDOM FOREST
* MULTILAYER PERCEPTRON
QUITE STABLE!!
CLASS PREDICTION
MARGINALLY
AFFECTED BY NOISE
MORE SENSITIVE!!
The results, in the context of this peculiar
domain, confirm the robustness of neural
network classification techniques and their
reliability for treating noisy data. Even though
decision trees and random forests reach higher
accuracy rates on clean and safe test data, when
noise is present, they result less robust and stable.
FURTHERS WORKS
COMING SOON!!!!
Conclusions