pattern recognition and machine learning lucy kuncheva school of computer science bangor university...

1

Pattern Recognition and Machine Learning

Lucy Kuncheva

School of Computer ScienceBangor [email protected]

Part 2

mailto:[email protected]

2

Pattern Recognition – DIY using WEKA

3

The weka (also known as Maori hen or woodhen) (Gallirallus australis) is a flightless bird species of the rail family. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit.

http://en.wikipedia.org/wiki/Weka



WEKAhttp://www.cs.waikato.ac.nz/ml/weka/

“WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.” 4

http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/

5

WEKAAnd we will be using only the hammer...

6

Data set:

OB

JEC

TS

FEATURES(attributes, variables, covariates...)

123.N

object # 3

1 2 3 . . . n

feature # 2

8

Your data sets are of the WIDE type: small number of objects, large number of features

PROBLEM

WEKA

7

Prepare the file .arff:

1. Open in an ascii editor2. Add rows• @RELATION one_word• @ATTRIBUTE name

NUMERIC... for all features

• @ATTRIBUTE class {1,2}... for the class variable

• @DATA3. Paste the data underneath

8

Feature selection

(b) Feature subsets

2 questio

ns

How do we select the subsets?

How do we evaluate the worth of a

subset?

9

Feature selection

(b) Feature subsets

2 questio

ns



subset?Not our problem now

Classification accuracy

Wrapper

Filter Embedded

Some easier-to-calculate proxy for the


Decision tree classifierSVM

10

Feature selection

(b) Feature subsets

2 questio

ns



subset?

Wrapper Filter

Embedded

Ranker Greedy

Sequential Forward Selection

(SFS)

Random

Heuristic search

Bespoke

Genetic Algorithms

(GA)

Swarm optimisation

11

Feature selection

(b) Feature subsets

2 questio

ns



subset?

Wrapper Filter

Embedded

Ranker Greedy

Sequential Forward Selection

(SFS)

Random

Heuristic search

Bespoke

Genetic Algorithms

(GA)

Swarm optimisation

12

Feature selection methods

FCBF (Fast Correlation-Based Filter) - originally proposed for microarray data analysis (Yu and Liu, 2003). The idea of FCBF is that the features that are worth keeping should be correlated with the class variable but not correlated among themselves.

CfsSubsetEval

1. L. Yu and H. Liu (2003), Feature selection for high-dimensional data: A fast correlation-based filter solution.

13


Relief-F. Kira and Rendell, 1992; Kononenko et al., 1997.

For each object in the data set, find the nearest neighbour from the same class (NearHit) and the nearest neighbour from the opposite class (NearMiss) using all features. The relevance score of a feature increases if the feature value in the current object is closer to that in the NearHit compared to that in the NearMiss. Otherwise, the relevance score of the feature decreases.

ReliefFAttributeEval

1. K. Kira and L. Rendell (1992). The Feature Selection Problem: Traditional Methods and a New Algorithm. AAAI-92 Proceedings.

2. I. Kononenko et al. Overcoming the myopia of inductive learning algorithms with RELIEFF (1997), Applied Intelligence, 7(1), p39-55

14


Relief-F.

Current object

NearHit

NearMiss

Relevance score for x increases

Relevance score for ydecreases

15


SVM. This classifier builds a linear function that separates the classes. The hyperplane is calculated so as to maximise the distance to the nearest points. The absolute values of the coefficients in front of the features can be interpreted as “importance”. SVM-RFE. RFE stands for “Recursive Feature Elimination” (Guyon et al., 2006). Starting with an SVM on the entire feature set, a fraction of the features with the lowest weights is dropped. A new SVM is trained with the remaining features, and subsequently reduced in the same way. The procedure stops when the set of the desired cardinality is reached. While SVM-RFE has been found to be extremely useful for wide data such as functional magnetic resonance imaging (fMRI) data (DeMartino et al., 2008), it was discovered that the RFE step is not always needed (Abeel et al., 2010; Geurts et al., 2005).SVMAttributeEval

16


SVM-RFEEliminate one feature at each iteration(default)

SVMSet this value to 0

17


Ranked attributes: 6 2 GRIP_TEST_Right 5 5 HEIGHT_Standing_cm 4 1 GRIP_TEST_Left 3 4 HEIGHT_Seated_cm 2 3 WEIGHT_Kg 1 6 ARM_SPAN_cm

For this example, both SVM and SVM-RFE give the same result

Selected attributes: 1,2,5 : 3 GRIP_TEST_Left GRIP_TEST_Right HEIGHT_Standing_cm

FCBF

Ranked attributes: 0.07863 2 GRIP_TEST_Right 0.07549 5 HEIGHT_Standing_cm 0.05528 4 HEIGHT_Seated_cm 0.05414 1 GRIP_TEST_Left 0.03172 3 WEIGHT_Kg 0.00797 6 ARM_SPAN_cm

Relief-F

18


Ranked attributes: 6 2 GRIP_TEST_Right 5 5 HEIGHT_Standing_cm 4 1 GRIP_TEST_Left 3 4 HEIGHT_Seated_cm 2 3 WEIGHT_Kg 1 6 ARM_SPAN_cm

For this example, both SVM and SVM-RFE give the same result

Selected attributes: 1,2,5 : 3 GRIP_TEST_Left GRIP_TEST_Right HEIGHT_Standing_cm

FCBF

Ranked attributes: 0.07863 2 GRIP_TEST_Right 0.07549 5 HEIGHT_Standing_cm 0.05528 4 HEIGHT_Seated_cm 0.05414 1 GRIP_TEST_Left 0.03172 3 WEIGHT_Kg 0.00797 6 ARM_SPAN_cm

Relief-F

PROBLEM

While these results are (probably) curious, there is no statistical significance we can attach to them...

19

Time for a coffee-break

20


Permutation test

Feature of interest: X

Class label variable: Y (say, G/N)

Let XG be the sample from class G, and XN, the sample from class N.

Two-sample t-test can be used to test the hypothesis of equal means when XG and XN come from approximately normal distributions.

If we cannot ascertain this condition, use PERMUTATION tests.

Quantity of interest

V = | mXG - mXN | (difference between the two means)

Observed value for our data: V*

Question: What is the probability that we observe V* if there was no relationship between X and the class label Y.

X Y4.32.11.82.33.2

G

N

G

G

N

... ...

21


Permutation test

p-value = 0.00460 2 4 6 8 10

0

5000

10000

15000

20000

25000

30000

abs(mean1 - mean2)

# oc

cure

nces

1. ANTHRO- HEIGHT - Standing (cm)

Observed value

Histogram of V for permuted labels

Very small chance to obtain the observed V* or larger.

22


1. ANTHRO- HEIGHT - Standing (cm) 1. ANTHRO- HEIGHT - Seated (cm) 1. ANTHRO - GRIP TEST Right 2.1 DT PACE BOWL - Average MPH 1. ANTHRO-WEIGHT (Kg) 1. ANTHRO - GRIP TEST - Left 1. ANTHRO - ARM SPAN (cm) 2.1 DT PACE BOWL - max MPH 8.1 FT - SPRINT (40m) 8.1 FT - SPRINT (30m)

Permutation test

0.0046 0.0058 0.0077 0.0100 0.0123 0.0159 0.0193 0.0266 0.0319 0.0489

p-value feature

23

Neuroscientist Craig Bennett purchased a whole Atlantic salmon, took it to a lab at Dartmouth, and put it into an fMRI machine used to study the brain. The beautiful fish was to be the lab’s test object as they worked out some new methods.

So, as the fish sat in the scanner, they showed it “a series of photographs depicting human individuals in social situations.” To maintain the rigor of the protocol (and perhaps because it was hilarious), the salmon, just like a human test subject, “was asked to determine what emotion the individual in the photo must have been experiencing.”

The Dead Salmon Lo and behold! Brain activity responding to the stimuli!

24

Bonferroni correction for multiple comparisons = the simplest and most conservative method to control the familywise error rate

If we increase the number of hypotheses in a test, we also increase the likelihood of witnessing a rare event, and therefore declaring difference when there is none.

So, if the desired significance level for the whole family of n tests should be (at most) α, then the Bonferroni correction would test each individual hypothesis at a significance level of α/n.

In our case, we have n = 50, significance level 0.05/50 = 0.001.

25


1. ANTHRO- HEIGHT - Standing (cm) 1. ANTHRO- HEIGHT - Seated (cm) 1. ANTHRO - GRIP TEST Right 2.1 DT PACE BOWL - Average MPH 1. ANTHRO-WEIGHT (Kg) 1. ANTHRO - GRIP TEST - Left 1. ANTHRO - ARM SPAN (cm) 2.1 DT PACE BOWL - max MPH 8.1 FT - SPRINT (40m) 8.1 FT - SPRINT (30m)

Permutation test

0.0046 0.0058 0.0077 0.0100 0.0123 0.0159 0.0193 0.0266 0.0319 0.0489

p-value feature

PROBLEM

None of the features survives the Bonferroni correction (p < 0.001 for significance level 0.05).

26


Permutation test

More PROBLEMs

1. If there are permutation tests in WEKA, they are hidden very well...

2. If there is Bonferroni correction in WEKA, it is hidden very well too...

Solution?

DIY...

27


Permutation test

1. Calculate the observed value V*. Choose the number of iterations, e.g., T = 10,000 .

2. for i = 1:T

a) Permute the labels randomly

b) Calculate and store V(i) with the permuted labels

3. end (for)

4. Calculate the p-value as the proportion of V greater than or equal to V* .

5. If you do this experiment for n features, compare p with alpha/n, where alpha is your chosen significance level (typically alpha = 0.05).

Here is an algorithm for those of you with some programming experience:(the null hypothesis is “no difference”, hence V = 0; assume the greater the V, the larger the difference)

28


Permutation test

And here is a MATLAB script

% Permutation test (assume that there are no missing values)clear, close, clc X = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','G2:L27');[~,Y] = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','F2:F27'); % symbolic label[~,Names] = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','G1:L1'); % feature names % Convert Y to numbers (1 selected, 2 not selected)u = unique(Y); L = ones(size(Y)); L(strcmp(u(1),Y)) = 2; T = 20000;

continues on the next slide

29


Permutation test

And here is a MATLAB script continued from previous slide...

for i = 1:T la = L(randperm(numel(L))); for j = 1:size(X,2) % for each feature fe = X(:,j); V(i,j) = abs(mean(fe(la == 1)) - mean(fe(la == 2))); endend % p-values for the featuresfor j = 1:size(X,2) V_star(j) = abs(mean(X(L == 1,j)) - mean(X(L == 2,j))); p(j) = mean(V(:,j) > V_star(j)); fprintf('%35s %.4f\n',Names{j},p(j)) end

30


Permutation test

MATLAB output

1. ANTHRO - GRIP TEST - Left 0.0176

1. ANTHRO - GRIP TEST Right 0.0087

1. ANTHRO-WEIGHT (Kg) 0.0124

1. ANTHRO- HEIGHT - Seated (cm) 0.0059

1. ANTHRO- HEIGHT - Standing (cm) 0.0055

1. ANTHRO - ARM SPAN (cm) 0.0202

The numbers may vary slightly from one run to the next because of the random generator. However, the larger the iteration number (T), the better.

The p-values are not corrected (Bonferroni). Correction should be applied if necessary.

31

Time for a coffee-break

32

Time for our classifiers!!!

The Classification tab

Choose a

classifier

(SVM)Choose a training-

testingprotocol

When ready (all

chosen) click here

33

Where to find the results

The confusion

matrix

34

Where to find the results


(and classification error)

35

And a lot lot more ...

36

Thank you!

pattern recognition and machine learning lucy kuncheva school of computer science bangor university...

Documents

feature selection problem

feature value

feature increases

feature decreases

data preprocessing

data sets

attribute class

class nearhit