classification of chestnuts with feature selection by noise resilient classifiers

Classification of

chestnuts with

feature selection by

noise resilient

classifiers

Elena Roglia

Rossella Cancelliere

Rosa Meo

University of Turin – Department of Computer Science – Italy

Classifying chestnut plants

according to their place of origin

Which features ?

Which classifiers ?

What to do with noise?

Prediction of chestnut origin from

their properties is important!!

Industrial

applications

For example:

verification of

certificates of

product origin.

Papaya from

Italy?!!

Think before eat

!!

Why feature selection?

Botanic features are

extracted collected and

stored in a data set by

human agents.

The process is:

- lengthy

- costly

- error prone

Its nodes perform a

test on a data attribute:

the outcome partitions

the training set into

smaller partitions until

the class value becomes

homogeneous .

The class value is the

prediction of the

decision tree for the

set of data records

that reach that final

node.

Decision tree

C4.5 algorithm induces the form of the decision tree

using entropy of the class value.

It grows the tree until the entropy reduction falls under

a user-defined threshold.

Entropy

reduction

Tree induction by entropy

Random forest

Large number of decision

trees is grown.

Each tree depends on the

values of a random vector .

Predictions are usually

combined using the technique

of majority voting so that the

most popular class among

them is predicted.

is a special ensemble

learner

Which classifiers?

Initial data set

1600 Samples

19 Features from fruit

peculiarities

MLP RBF C4.5 RF SMO

58.12% 47.97% 49.81% 55.06% 52.50%

NEW

FEATURES

?

New data set

1600 Samples

37 Features from fruit and

plant peculiarities

Symmetrical Uncertainty,

Chi-Square Statistic,

Gain Ratio, Information

Gain, oneR

methodology…….

FINAL DATA SET

1600 Samples

6 Features selected by

entropy-based information

gain criterion

# of chestnuts/kg

Diameter of the trunk

# female inflorescence/ament

Ament length

Length of the leaf limb,

Height of the plant.

Some details……..

Random forest forest of trees built with

all 6 features and trained each on a

different training set built with random

selection of samples with replacement.

Training set (70%) 1120

Test set (30%) 480

Target class 8

MLP 6,12 and 8 neurons (one for each

geographic zone) respectively in input,

hidden and output layers. Training phase of

100 iterations.

Decision tree

Used default settings

of Weka tool.

Obtained a binary tree

of 15 leaves and 6

levels.

ACCURACIES ON TEST SET

MLP 97.91%

C4.5 100%

RF 100%

i(A)η0.05i(A)(A)i'

11

WHICH ARE THE

PERFORMANCE

IN NOISY DATA

SET?

i(A) is the value of attribute A in the i-th

instance and

ARE THESE

CLASSIFIERS

ROBUST?

11

)A(i05.0)A(i)A('i

ACCURACIES ON TEST SET

WITHOUT

NOISE

WITH

NOISE

MLP 97.91% 93.12%

C4.5 100% 82.29%

RF 100% 90.62%

+ DECISION TREE

∆ RANDOM FOREST

* MULTILAYER PERCEPTRON

QUITE STABLE!!

CLASS PREDICTION

MARGINALLY

AFFECTED BY NOISE

MORE SENSITIVE!!

The results, in the context of this peculiar

domain, confirm the robustness of neural

network classification techniques and their

reliability for treating noisy data. Even though

decision trees and random forests reach higher

accuracy rates on clean and safe test data, when

noise is present, they result less robust and stable.

FURTHERS WORKS

COMING SOON!!!!

Conclusions

classification of chestnuts with feature selection by noise resilient classifiers

Data & Analytics