v-detector: a real-valued negative selection algorithm zhou ji st. jude childrens research hospital

V-detector: a real-valued negative selection

algorithm

Zhou JiSt. Jude Children’s Research Hospital

What is negative selection?

Biological background: T cells, thymus Major steps:

1. Generate candidates randomly

2. Eliminate those that recognize self samples

Main steps

Generation detection

What is matching rule?

When a sample and a detector are considered matching.

Matching rule plays an important role in negative selection algorithm. It largely depends on the data representation.

In real-valued representation, detector can be visualized as hyper-sphere.Candidate 1: thrown-away; candidate 2: made a detector.

Match or not match?

Main idea of V-detector

By allowing the detectors to have some variable properties, V-detector enhances negative selection algorithm from several aspects: It takes fewer large detectors to cover non-self region –

saving time and space Small detector covers “holes” better. Coverage is estimated when the detector set is generated.

The shapes of detectors or even the types of matching rules can be extended to be variable too.

Main concept of Negative Selection and V-detector

Constant-sized detectors Variable-sized detectors

Outline of the algorithm (generation of variable-sized detector set)

Detector Set Generation Algorithm

Dreturn :20

maxT|D| Until:19

exit coverage) self maximum-1/(1 T if :18

1TT else :17

r radius and location xith detector w a is r x, where},,{DD then 0r if :16

:sr-drr then sr-d if :15

xand isbetween distanceEuclidean d :14

Sin severy for Repeat :13

:4 togo :12

return then )01/(1 tif :11

1t t :10

iddetector of radius theis )ir(d where then,)ir(ddd if :9

id oflocation theis )i x(d where x,and )idbetween x( distanceEuclidean dd :8

...} 2, 1,i .i{dDin idevery for Repeat :7

]01[ from sample random :6

inifiniter :5

0T :4

0 t :3

Repeat :2

D :1

coverage estimated :0c

radius self :

detector ofnumber maximum :maxT

samples self ofset :

),maxT Set(S,-Detector-V

rx

i

Dc

n, x

sr

S

ocs, r

D

mD

xDD

xisd

iisSis

nx

sr

m

S

srm,(S

return :9

|| Until:8

} { :7

2 togo ,srd if :6

and between distanceEuclidean :5

,...}2,1,{in every for Repeat :4

0] [1, from sample random :3

Repeat :2

D :1

radius self: :

detectors ofnumber :

samples self ofset :

),Set-Detector

Constant-sized detectors

Variable-sized detectors

Screenshots of the software

Message view Visualization of data points and detectors

Experiments and Results Synthetic Data

2D. Training data are randomly chosen from the normal region. Fisher’s Iris Data

One of the three types is considered as “normal”. Biomedical Data

Abnormal data are the medical measures of disease carrier patients.

Air Pollution Data Abnormal data are made by artificially altering the normal air

measurements Ball bearings:

Measurement: time series data with preprocessing - 30D and 5D

Synthetic data - Cross-shaped self space Shape of self region and example detector coverage

(a) Actual self space (b) self radius = 0.05 (c) self radius = 0.1

Synthetic data - Cross-shaped self

space Results

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

det

ecti

on

rat

e

0

10

20

30

40

50

60

70

80

90

fals

e a

larm

rat

e

Detection rate (99.99% coverage) Detection rate (99% coverage)False alarm rate (99% coverage) False alarm rate (99.99% coverage)

0

200

400

600

800

1000

1200

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

nu

mb

er o

f d

etec

tors

99.99% coverage 99% coverage

Detection rate and false alarm rate Number of detectors

Error rates

0

5

10

15

20

25

30

35

40

45

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

err

or

rate

(p

erc

en

tag

e)

false negative (99% coverage) false positive (99% coverage)

Synthetic data - Ring-shaped self space Shape of self region and example detector coverage

(a) Actual self space (b) self radius = 0.05 (c) self radius = 0.1

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

det

ecti

on

rat

e

0

10

20

30

40

50

60

70

fals

e a

larm

rat

e


0

200

400

600

800

1000

1200

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

nu

mb

er o

f d

etec

tors


Synthetic data - Ring-shaped self

space Results


Iris dataComparison with other methods: performance

Detection rate False alarm rate

Setosa 100% MILA 95.16 0

NSA (single level) 100 0

V-detector 99.98 0

Setosa 50% MILA 94.02 8.42

NSA (single level) 100 11.18

V-detector 99.97 1.32

Versicolor 100% MILA 84.37 0

NSA (single level) 95.67 0

V-detector 85.95 0

Versicolor 50% MILA 84.46 19.6

NSA (single level) 96 22.2


Virginica 100% MILA 75.75 0

NSA (single level) 92.51 0

V-detector 81.87 0

Virginica 50% MILA 88.96 24.98

NSA (single level) 97.18 33.26


Iris dataComparison with other methods: number of detectors

mean max Min SD

Setosa 100% 20 42 5 7.87

Setosa 50% 16.44 33 5 5.63

Veriscolor 100% 153.24 255 72 38.8

Versicolor 50% 110.08 184 60 22.61

Virginica 100% 218.36 443 78 66.11

Virginica 50% 108.12 203 46 30.74

Iris dataVirginica as normal, 50% points used to train

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

de

tec

tio

n r

ate

0

10

20

30

40

50

60

fals

e a

larm

ra

te


0

200

400

600

800

1000

1200

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

nu

mb

er

of

de

tec

tors



Biomedical data

Blood measure for a group of 209 patients Each patient has four different types of

measurement 75 patients are carriers of a rare genetic

disorder. Others are normal.

Biomedical data: results comparison

Training Data Algorithm Detection Rate False Alarm rate Number of Detectors

Mean SD Mean SD Mean SD

100% training MILA 59.07 3.85 0 0 1000* 0

NSA 69.36 2.67 0 0 1000 0

r=0.1 30.61 3.04 0 0 21.52 7.29

r=0.05 40.51 3.92 0 0 14.84 5.14

50% training MILA 61.61 3.82 2.43 0.43 1000* 0

NSA 72.29 2.63 2.94 0.21 1000 0

r = 0.1 32.92 2.35 0.61 0.31 15.51 4.85

r=0.05 42.89 3.83 1.07 0.49 12.28 4

25% training MILA 80.47 2.80 14.93 2.08 1000* 0

NSA 86.96 2.72 19.50 2.05 1000 0

r=0.1 43.68 4.25 1.24 0.5 12.24 3.97

r=0.05 57.97 5.86 2.63 0.77 8.94 2.57

Biomedical data

0

10

20

30

40

50

60

70

80

90

100

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

de

tec

tio

n r

ate

0

10

20

30

40

50

60

fals

e a

larm

ra

te


0

200

400

600

800

1000

1200

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radiusn

um

be

r o

f d

ete

cto

rs



Air pollution data Totally 60 original records. Each is 16 different measurements concerning air pollution. All the real data are considered as normal. More data are made artificially:

1. Decide the normal range of each of 16 measurements2. Randomly choose a real record3. Change three randomly chosen measurements within a larger

than normal range4. If some the changed measurements are out of range, the

record is considered abnormal; otherwise they are considered normal

Totally 1000 records including the original 60 are used as test data. The original 60 are used as training data.

Air pollution data

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

de

tec

tio

n r

ate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fals

e a

larm

ra

te


0

200

400

600

800

1000

1200

0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19

self radius

nu

mb

er

of

de

tec

tors



Ball bearing data

raw data: time series of acceleration measurements

Preprocessing (from time domain to representation space for detection)

1. FFT (Fast Fourier Transform) with Hanning windowing: window size 30

2. Statistical moments: up to 5th order

Example of data (raw data of new bearings) --- first 1000 points

-60

-40

-20

0

20

40

60

80

1 33 65 97 129 161 193 225 257 289 321 353 385 417 449 481 513 545 577 609 641 673 705 737 769 801 833 865 897 929 961 993

Example of data (FFT of new bearings) --- first 3 coefficients of the first 100 points

0

100

200

300

400

500

600

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

coefficient 1 coefficient 2 coeffcient 3

Example of data (statistical moments of new bearings) --- moments up to 3rd order of the first 100 points

-2000

-1000

0

1000

2000

3000

4000

5000

6000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

1st order 2nd order 3rd order

Ball bearing’s structure and damage

Damaged cage

Ball bearing data: resultsBall bearing conditions Total number of data points Number of detected

anomaliesPercentage detected

New bearing (normal) 2739 0 0%

Outer race completely broken 2241 2182 97.37%

Broken cage with one loose element 2988 577 19.31%

Damage cage, four loose elements 2988 337 11.28%

No evident damage; badly worn 2988 209 6.99%

Ball bearing conditions Total number of data points Number of detectedanomalies

Percentage detected

New bearing (normal) 2651 0 0%

Outer race completely broken 2169 1674 77.18%

Broken cage with one loose element 2892 14 0.48%

Damage cage, four loose elements 2892 0 0%

No evident damage; badly worn 2892 0 0%

Preprocessed with FFT

Preprocessed with statistical moments

Ball bearing data: performance summary

Statistical Moments

77.18

Statistical Moments

21.22

FourierTransform97.37



Statistical Moments

00

20

40

60

80

100

120

Detection Rate for the WorstDamage

Detection Rate for AllDamages

False Alarm Rate

New development of this work

A new algorithm to generate variable-sized detectors. Purpose: reduce the possible “false negative” at the

boundary of self region Why the issue exits: some self samples may be very close

to the boundary. Main idea: differentiate between “internal self samples” and

“boundary self samples” Solution: combine the advantage of the algorithms to

generate variable-sized and constant-sized detectors described previously.

How much one sample tells

Samples may be on boundary

In term of detectors

Comparing three methods

Constant-sized detectors V-detector New algorithm

Self radius = 0.05

Comparing three methods

Constant-sized detectors V-detectors New algorithm

Self radius = 0.1

Work ongoing

Estimate of coverage using formal statistics “point estimate” is the simplest method. Two types of statistical inference:

1. Confidence interval

2. Hypothesis testing

Point estimate of proportion

Summary

1. V-detector uses fewer detectors to obtain similar coverage.2. Smaller detectors are more acceptable if the total number of

detectors are largely controlled.3. Coverage estimate is superior to fixed number of detectors.4. V-detector can deal with high-dimensional data, including

time series, better.5. Self radius and estimated coverage are the two control

parameters in V-detector.6. Variable size, variable shape, variable matching rules, or

other variable properties of detectors provide encouraging opportunity to enhance negative selection mechanism.

9-17-2004

v-detector: a real-valued negative selection algorithm zhou ji st. jude childrens research hospital

Documents

real data

biomedical data abnormal

data representation

test data

d slide

ball bearing data raw

iris data virginica

synthetic data cross