v-detector: a real-valued negative selection algorithm zhou ji st. jude childrens research hospital
TRANSCRIPT
V-detector: a real-valued negative selection
algorithm
Zhou JiSt. Jude Children’s Research Hospital
What is negative selection?
Biological background: T cells, thymus Major steps:
1. Generate candidates randomly
2. Eliminate those that recognize self samples
Main steps
Generation detection
What is matching rule?
When a sample and a detector are considered matching.
Matching rule plays an important role in negative selection algorithm. It largely depends on the data representation.
In real-valued representation, detector can be visualized as hyper-sphere.Candidate 1: thrown-away; candidate 2: made a detector.
Match or not match?
Main idea of V-detector
By allowing the detectors to have some variable properties, V-detector enhances negative selection algorithm from several aspects: It takes fewer large detectors to cover non-self region –
saving time and space Small detector covers “holes” better. Coverage is estimated when the detector set is generated.
The shapes of detectors or even the types of matching rules can be extended to be variable too.
Main concept of Negative Selection and V-detector
Constant-sized detectors Variable-sized detectors
Outline of the algorithm (generation of variable-sized detector set)
Detector Set Generation Algorithm
Dreturn :20
maxT|D| Until:19
exit coverage) self maximum-1/(1 T if :18
1TT else :17
r radius and location xith detector w a is r x, where},,{DD then 0r if :16
:sr-drr then sr-d if :15
xand isbetween distanceEuclidean d :14
Sin severy for Repeat :13
:4 togo :12
return then )01/(1 tif :11
1t t :10
iddetector of radius theis )ir(d where then,)ir(ddd if :9
id oflocation theis )i x(d where x,and )idbetween x( distanceEuclidean dd :8
...} 2, 1,i .i{dDin idevery for Repeat :7
]01[ from sample random :6
inifiniter :5
0T :4
0 t :3
Repeat :2
D :1
coverage estimated :0c
radius self :
detector ofnumber maximum :maxT
samples self ofset :
),maxT Set(S,-Detector-V
rx
i
Dc
n, x
sr
S
ocs, r
D
mD
xDD
xisd
iisSis
nx
sr
m
S
srm,(S
return :9
|| Until:8
} { :7
2 togo ,srd if :6
and between distanceEuclidean :5
,...}2,1,{in every for Repeat :4
0] [1, from sample random :3
Repeat :2
D :1
radius self: :
detectors ofnumber :
samples self ofset :
),Set-Detector
Constant-sized detectors
Variable-sized detectors
Screenshots of the software
Message view Visualization of data points and detectors
Experiments and Results Synthetic Data
2D. Training data are randomly chosen from the normal region. Fisher’s Iris Data
One of the three types is considered as “normal”. Biomedical Data
Abnormal data are the medical measures of disease carrier patients.
Air Pollution Data Abnormal data are made by artificially altering the normal air
measurements Ball bearings:
Measurement: time series data with preprocessing - 30D and 5D
Synthetic data - Cross-shaped self space Shape of self region and example detector coverage
(a) Actual self space (b) self radius = 0.05 (c) self radius = 0.1
Synthetic data - Cross-shaped self
space Results
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
det
ecti
on
rat
e
0
10
20
30
40
50
60
70
80
90
fals
e a
larm
rat
e
Detection rate (99.99% coverage) Detection rate (99% coverage)False alarm rate (99% coverage) False alarm rate (99.99% coverage)
0
200
400
600
800
1000
1200
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
nu
mb
er o
f d
etec
tors
99.99% coverage 99% coverage
Detection rate and false alarm rate Number of detectors
Error rates
0
5
10
15
20
25
30
35
40
45
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
err
or
rate
(p
erc
en
tag
e)
false negative (99% coverage) false positive (99% coverage)
Synthetic data - Ring-shaped self space Shape of self region and example detector coverage
(a) Actual self space (b) self radius = 0.05 (c) self radius = 0.1
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
det
ecti
on
rat
e
0
10
20
30
40
50
60
70
fals
e a
larm
rat
e
Detection rate (99.99% coverage) Detection rate (99% coverage)False alarm rate (99% coverage) False alarm rate (99.99% coverage)
0
200
400
600
800
1000
1200
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
nu
mb
er o
f d
etec
tors
99.99% coverage 99% coverage
Synthetic data - Ring-shaped self
space Results
Detection rate and false alarm rate Number of detectors
Iris dataComparison with other methods: performance
Detection rate False alarm rate
Setosa 100% MILA 95.16 0
NSA (single level) 100 0
V-detector 99.98 0
Setosa 50% MILA 94.02 8.42
NSA (single level) 100 11.18
V-detector 99.97 1.32
Versicolor 100% MILA 84.37 0
NSA (single level) 95.67 0
V-detector 85.95 0
Versicolor 50% MILA 84.46 19.6
NSA (single level) 96 22.2
V-detector 88.3 8.42
Virginica 100% MILA 75.75 0
NSA (single level) 92.51 0
V-detector 81.87 0
Virginica 50% MILA 88.96 24.98
NSA (single level) 97.18 33.26
V-detector 93.58 13.18
Iris dataComparison with other methods: number of detectors
mean max Min SD
Setosa 100% 20 42 5 7.87
Setosa 50% 16.44 33 5 5.63
Veriscolor 100% 153.24 255 72 38.8
Versicolor 50% 110.08 184 60 22.61
Virginica 100% 218.36 443 78 66.11
Virginica 50% 108.12 203 46 30.74
Iris dataVirginica as normal, 50% points used to train
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
de
tec
tio
n r
ate
0
10
20
30
40
50
60
fals
e a
larm
ra
te
Detection rate (99.99% coverage) Detection rate (99% coverage)False alarm rate (99% coverage) False alarm rate (99.99% coverage)
0
200
400
600
800
1000
1200
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
nu
mb
er
of
de
tec
tors
99.99% coverage 99% coverage
Detection rate and false alarm rate Number of detectors
Biomedical data
Blood measure for a group of 209 patients Each patient has four different types of
measurement 75 patients are carriers of a rare genetic
disorder. Others are normal.
Biomedical data: results comparison
Training Data Algorithm Detection Rate False Alarm rate Number of Detectors
Mean SD Mean SD Mean SD
100% training MILA 59.07 3.85 0 0 1000* 0
NSA 69.36 2.67 0 0 1000 0
r=0.1 30.61 3.04 0 0 21.52 7.29
r=0.05 40.51 3.92 0 0 14.84 5.14
50% training MILA 61.61 3.82 2.43 0.43 1000* 0
NSA 72.29 2.63 2.94 0.21 1000 0
r = 0.1 32.92 2.35 0.61 0.31 15.51 4.85
r=0.05 42.89 3.83 1.07 0.49 12.28 4
25% training MILA 80.47 2.80 14.93 2.08 1000* 0
NSA 86.96 2.72 19.50 2.05 1000 0
r=0.1 43.68 4.25 1.24 0.5 12.24 3.97
r=0.05 57.97 5.86 2.63 0.77 8.94 2.57
Biomedical data
0
10
20
30
40
50
60
70
80
90
100
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
de
tec
tio
n r
ate
0
10
20
30
40
50
60
fals
e a
larm
ra
te
Detection rate (99.99% coverage) Detection rate (99% coverage)False alarm rate (99% coverage) False alarm rate (99.99% coverage)
0
200
400
600
800
1000
1200
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radiusn
um
be
r o
f d
ete
cto
rs
99.99% coverage 99% coverage
Detection rate and false alarm rate Number of detectors
Air pollution data Totally 60 original records. Each is 16 different measurements concerning air pollution. All the real data are considered as normal. More data are made artificially:
1. Decide the normal range of each of 16 measurements2. Randomly choose a real record3. Change three randomly chosen measurements within a larger
than normal range4. If some the changed measurements are out of range, the
record is considered abnormal; otherwise they are considered normal
Totally 1000 records including the original 60 are used as test data. The original 60 are used as training data.
Air pollution data
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
de
tec
tio
n r
ate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fals
e a
larm
ra
te
Detection rate (99.99% coverage) Detection rate (99% coverage)False alarm rate (99% coverage) False alarm rate (99.99% coverage)
0
200
400
600
800
1000
1200
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
self radius
nu
mb
er
of
de
tec
tors
99.99% coverage 99% coverage
Detection rate and false alarm rate Number of detectors
Ball bearing data
raw data: time series of acceleration measurements
Preprocessing (from time domain to representation space for detection)
1. FFT (Fast Fourier Transform) with Hanning windowing: window size 30
2. Statistical moments: up to 5th order
Example of data (raw data of new bearings) --- first 1000 points
-60
-40
-20
0
20
40
60
80
1 33 65 97 129 161 193 225 257 289 321 353 385 417 449 481 513 545 577 609 641 673 705 737 769 801 833 865 897 929 961 993
Example of data (FFT of new bearings) --- first 3 coefficients of the first 100 points
0
100
200
300
400
500
600
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
coefficient 1 coefficient 2 coeffcient 3
Example of data (statistical moments of new bearings) --- moments up to 3rd order of the first 100 points
-2000
-1000
0
1000
2000
3000
4000
5000
6000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
1st order 2nd order 3rd order
Ball bearing’s structure and damage
Damaged cage
Ball bearing data: resultsBall bearing conditions Total number of data points Number of detected
anomaliesPercentage detected
New bearing (normal) 2739 0 0%
Outer race completely broken 2241 2182 97.37%
Broken cage with one loose element 2988 577 19.31%
Damage cage, four loose elements 2988 337 11.28%
No evident damage; badly worn 2988 209 6.99%
Ball bearing conditions Total number of data points Number of detectedanomalies
Percentage detected
New bearing (normal) 2651 0 0%
Outer race completely broken 2169 1674 77.18%
Broken cage with one loose element 2892 14 0.48%
Damage cage, four loose elements 2892 0 0%
No evident damage; badly worn 2892 0 0%
Preprocessed with FFT
Preprocessed with statistical moments
Ball bearing data: performance summary
Statistical Moments
77.18
Statistical Moments
21.22
FourierTransform97.37
FourierTransform37.68
FourierTransform3.65
Statistical Moments
00
20
40
60
80
100
120
Detection Rate for the WorstDamage
Detection Rate for AllDamages
False Alarm Rate
New development of this work
A new algorithm to generate variable-sized detectors. Purpose: reduce the possible “false negative” at the
boundary of self region Why the issue exits: some self samples may be very close
to the boundary. Main idea: differentiate between “internal self samples” and
“boundary self samples” Solution: combine the advantage of the algorithms to
generate variable-sized and constant-sized detectors described previously.
How much one sample tells
Samples may be on boundary
In term of detectors
Comparing three methods
Constant-sized detectors V-detector New algorithm
Self radius = 0.05
Comparing three methods
Constant-sized detectors V-detectors New algorithm
Self radius = 0.1
Work ongoing
Estimate of coverage using formal statistics “point estimate” is the simplest method. Two types of statistical inference:
1. Confidence interval
2. Hypothesis testing
Point estimate of proportion
Summary
1. V-detector uses fewer detectors to obtain similar coverage.2. Smaller detectors are more acceptable if the total number of
detectors are largely controlled.3. Coverage estimate is superior to fixed number of detectors.4. V-detector can deal with high-dimensional data, including
time series, better.5. Self radius and estimated coverage are the two control
parameters in V-detector.6. Variable size, variable shape, variable matching rules, or
other variable properties of detectors provide encouraging opportunity to enhance negative selection mechanism.
9-17-2004