pattern recognition and machine learning lucy kuncheva school of computer science bangor university...
TRANSCRIPT
1
Pattern Recognition and Machine Learning
Lucy Kuncheva
School of Computer ScienceBangor [email protected]
Part 2
2
Pattern Recognition – DIY using WEKA
3
The weka (also known as Maori hen or woodhen) (Gallirallus australis) is a flightless bird species of the rail family. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit.
http://en.wikipedia.org/wiki/Weka
WEKAhttp://www.cs.waikato.ac.nz/ml/weka/
“WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.” 4
5
WEKAAnd we will be using only the hammer...
6
Data set:
OB
JEC
TS
FEATURES(attributes, variables, covariates...)
123.N
object # 3
1 2 3 . . . n
feature # 2
8
Your data sets are of the WIDE type: small number of objects, large number of features
PROBLEM
WEKA
7
Prepare the file .arff:
1. Open in an ascii editor2. Add rows• @RELATION one_word• @ATTRIBUTE name
NUMERIC... for all features
• @ATTRIBUTE class {1,2}... for the class variable
• @DATA3. Paste the data underneath
8
Feature selection
(b) Feature subsets
2 questio
ns
How do we select the subsets?
How do we evaluate the worth of a
subset?
9
Feature selection
(b) Feature subsets
2 questio
ns
How do we select the subsets?
How do we evaluate the worth of a
subset?Not our problem now
Classification accuracy
Wrapper
Filter Embedded
Some easier-to-calculate proxy for the
Classification accuracy
Decision tree classifierSVM
10
Feature selection
(b) Feature subsets
2 questio
ns
How do we select the subsets?
How do we evaluate the worth of a
subset?
Wrapper Filter
Embedded
Ranker Greedy
Sequential Forward Selection
(SFS)
Random
Heuristic search
Bespoke
Genetic Algorithms
(GA)
Swarm optimisation
11
Feature selection
(b) Feature subsets
2 questio
ns
How do we select the subsets?
How do we evaluate the worth of a
subset?
Wrapper Filter
Embedded
Ranker Greedy
Sequential Forward Selection
(SFS)
Random
Heuristic search
Bespoke
Genetic Algorithms
(GA)
Swarm optimisation
12
Feature selection methods
FCBF (Fast Correlation-Based Filter) - originally proposed for microarray data analysis (Yu and Liu, 2003). The idea of FCBF is that the features that are worth keeping should be correlated with the class variable but not correlated among themselves.
CfsSubsetEval
1. L. Yu and H. Liu (2003), Feature selection for high-dimensional data: A fast correlation-based filter solution.
13
Feature selection methods
Relief-F. Kira and Rendell, 1992; Kononenko et al., 1997.
For each object in the data set, find the nearest neighbour from the same class (NearHit) and the nearest neighbour from the opposite class (NearMiss) using all features. The relevance score of a feature increases if the feature value in the current object is closer to that in the NearHit compared to that in the NearMiss. Otherwise, the relevance score of the feature decreases.
ReliefFAttributeEval
1. K. Kira and L. Rendell (1992). The Feature Selection Problem: Traditional Methods and a New Algorithm. AAAI-92 Proceedings.
2. I. Kononenko et al. Overcoming the myopia of inductive learning algorithms with RELIEFF (1997), Applied Intelligence, 7(1), p39-55
14
Feature selection methods
Relief-F.
Current object
NearHit
NearMiss
Relevance score for x increases
Relevance score for ydecreases
15
Feature selection methods
SVM. This classifier builds a linear function that separates the classes. The hyperplane is calculated so as to maximise the distance to the nearest points. The absolute values of the coefficients in front of the features can be interpreted as “importance”. SVM-RFE. RFE stands for “Recursive Feature Elimination” (Guyon et al., 2006). Starting with an SVM on the entire feature set, a fraction of the features with the lowest weights is dropped. A new SVM is trained with the remaining features, and subsequently reduced in the same way. The procedure stops when the set of the desired cardinality is reached. While SVM-RFE has been found to be extremely useful for wide data such as functional magnetic resonance imaging (fMRI) data (DeMartino et al., 2008), it was discovered that the RFE step is not always needed (Abeel et al., 2010; Geurts et al., 2005).SVMAttributeEval
16
Feature selection methods
SVM-RFEEliminate one feature at each iteration(default)
SVMSet this value to 0
17
Feature selection methods
Ranked attributes: 6 2 GRIP_TEST_Right 5 5 HEIGHT_Standing_cm 4 1 GRIP_TEST_Left 3 4 HEIGHT_Seated_cm 2 3 WEIGHT_Kg 1 6 ARM_SPAN_cm
For this example, both SVM and SVM-RFE give the same result
Selected attributes: 1,2,5 : 3 GRIP_TEST_Left GRIP_TEST_Right HEIGHT_Standing_cm
FCBF
Ranked attributes: 0.07863 2 GRIP_TEST_Right 0.07549 5 HEIGHT_Standing_cm 0.05528 4 HEIGHT_Seated_cm 0.05414 1 GRIP_TEST_Left 0.03172 3 WEIGHT_Kg 0.00797 6 ARM_SPAN_cm
Relief-F
18
Feature selection methods
Ranked attributes: 6 2 GRIP_TEST_Right 5 5 HEIGHT_Standing_cm 4 1 GRIP_TEST_Left 3 4 HEIGHT_Seated_cm 2 3 WEIGHT_Kg 1 6 ARM_SPAN_cm
For this example, both SVM and SVM-RFE give the same result
Selected attributes: 1,2,5 : 3 GRIP_TEST_Left GRIP_TEST_Right HEIGHT_Standing_cm
FCBF
Ranked attributes: 0.07863 2 GRIP_TEST_Right 0.07549 5 HEIGHT_Standing_cm 0.05528 4 HEIGHT_Seated_cm 0.05414 1 GRIP_TEST_Left 0.03172 3 WEIGHT_Kg 0.00797 6 ARM_SPAN_cm
Relief-F
PROBLEM
While these results are (probably) curious, there is no statistical significance we can attach to them...
19
Time for a coffee-break
20
Feature selection methods
Permutation test
Feature of interest: X
Class label variable: Y (say, G/N)
Let XG be the sample from class G, and XN, the sample from class N.
Two-sample t-test can be used to test the hypothesis of equal means when XG and XN come from approximately normal distributions.
If we cannot ascertain this condition, use PERMUTATION tests.
Quantity of interest
V = | mXG - mXN | (difference between the two means)
Observed value for our data: V*
Question: What is the probability that we observe V* if there was no relationship between X and the class label Y.
X Y4.32.11.82.33.2
G
N
G
G
N
... ...
21
Feature selection methods
Permutation test
p-value = 0.00460 2 4 6 8 10
0
5000
10000
15000
20000
25000
30000
abs(mean1 - mean2)
# oc
cure
nces
1. ANTHRO- HEIGHT - Standing (cm)
Observed value
Histogram of V for permuted labels
Very small chance to obtain the observed V* or larger.
22
Feature selection methods
1. ANTHRO- HEIGHT - Standing (cm) 1. ANTHRO- HEIGHT - Seated (cm) 1. ANTHRO - GRIP TEST Right 2.1 DT PACE BOWL - Average MPH 1. ANTHRO-WEIGHT (Kg) 1. ANTHRO - GRIP TEST - Left 1. ANTHRO - ARM SPAN (cm) 2.1 DT PACE BOWL - max MPH 8.1 FT - SPRINT (40m) 8.1 FT - SPRINT (30m)
Permutation test
0.0046 0.0058 0.0077 0.0100 0.0123 0.0159 0.0193 0.0266 0.0319 0.0489
p-value feature
23
Neuroscientist Craig Bennett purchased a whole Atlantic salmon, took it to a lab at Dartmouth, and put it into an fMRI machine used to study the brain. The beautiful fish was to be the lab’s test object as they worked out some new methods.
So, as the fish sat in the scanner, they showed it “a series of photographs depicting human individuals in social situations.” To maintain the rigor of the protocol (and perhaps because it was hilarious), the salmon, just like a human test subject, “was asked to determine what emotion the individual in the photo must have been experiencing.”
The Dead Salmon Lo and behold! Brain activity responding to the stimuli!
24
Bonferroni correction for multiple comparisons = the simplest and most conservative method to control the familywise error rate
If we increase the number of hypotheses in a test, we also increase the likelihood of witnessing a rare event, and therefore declaring difference when there is none.
So, if the desired significance level for the whole family of n tests should be (at most) α, then the Bonferroni correction would test each individual hypothesis at a significance level of α/n.
In our case, we have n = 50, significance level 0.05/50 = 0.001.
25
Feature selection methods
1. ANTHRO- HEIGHT - Standing (cm) 1. ANTHRO- HEIGHT - Seated (cm) 1. ANTHRO - GRIP TEST Right 2.1 DT PACE BOWL - Average MPH 1. ANTHRO-WEIGHT (Kg) 1. ANTHRO - GRIP TEST - Left 1. ANTHRO - ARM SPAN (cm) 2.1 DT PACE BOWL - max MPH 8.1 FT - SPRINT (40m) 8.1 FT - SPRINT (30m)
Permutation test
0.0046 0.0058 0.0077 0.0100 0.0123 0.0159 0.0193 0.0266 0.0319 0.0489
p-value feature
PROBLEM
None of the features survives the Bonferroni correction (p < 0.001 for significance level 0.05).
26
Feature selection methods
Permutation test
More PROBLEMs
1. If there are permutation tests in WEKA, they are hidden very well...
2. If there is Bonferroni correction in WEKA, it is hidden very well too...
Solution?
DIY...
27
Feature selection methods
Permutation test
1. Calculate the observed value V*. Choose the number of iterations, e.g., T = 10,000 .
2. for i = 1:T
a) Permute the labels randomly
b) Calculate and store V(i) with the permuted labels
3. end (for)
4. Calculate the p-value as the proportion of V greater than or equal to V* .
5. If you do this experiment for n features, compare p with alpha/n, where alpha is your chosen significance level (typically alpha = 0.05).
Here is an algorithm for those of you with some programming experience:(the null hypothesis is “no difference”, hence V = 0; assume the greater the V, the larger the difference)
28
Feature selection methods
Permutation test
And here is a MATLAB script
% Permutation test (assume that there are no missing values)clear, close, clc X = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','G2:L27');[~,Y] = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','F2:F27'); % symbolic label[~,Names] = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','G1:L1'); % feature names % Convert Y to numbers (1 selected, 2 not selected)u = unique(Y); L = ones(size(Y)); L(strcmp(u(1),Y)) = 2; T = 20000;
continues on the next slide
29
Feature selection methods
Permutation test
And here is a MATLAB script continued from previous slide...
for i = 1:T la = L(randperm(numel(L))); for j = 1:size(X,2) % for each feature fe = X(:,j); V(i,j) = abs(mean(fe(la == 1)) - mean(fe(la == 2))); endend % p-values for the featuresfor j = 1:size(X,2) V_star(j) = abs(mean(X(L == 1,j)) - mean(X(L == 2,j))); p(j) = mean(V(:,j) > V_star(j)); fprintf('%35s %.4f\n',Names{j},p(j)) end
30
Feature selection methods
Permutation test
MATLAB output
1. ANTHRO - GRIP TEST - Left 0.0176
1. ANTHRO - GRIP TEST Right 0.0087
1. ANTHRO-WEIGHT (Kg) 0.0124
1. ANTHRO- HEIGHT - Seated (cm) 0.0059
1. ANTHRO- HEIGHT - Standing (cm) 0.0055
1. ANTHRO - ARM SPAN (cm) 0.0202
The numbers may vary slightly from one run to the next because of the random generator. However, the larger the iteration number (T), the better.
The p-values are not corrected (Bonferroni). Correction should be applied if necessary.
31
Time for a coffee-break
32
Time for our classifiers!!!
The Classification tab
Choose a
classifier
(SVM)Choose a training-
testingprotocol
When ready (all
chosen) click here
33
Where to find the results
The confusion
matrix
34
Where to find the results
Classification accuracy
(and classification error)
35
And a lot lot more ...
36
Thank you!