nasraoui icdm03 immune

8/13/2019 Nasraoui ICDM03 Immune

1/8


2/8

achieve scalability is to reduce the size of the repertoire of B-Cells,while still maintaining a high quality approximation of the antigenspace (data). High quality is achieved with a repertoire of B-cellsthat is at the same time, diverse or specialized, accurate, and com-plete.

Recently, data mining has put even higher demands on cluster-ing algorithms. They now must handle very large data sets, leadingto some scalable clustering techniques. For example, BIRCH [16]and the scalable K-Means (SKM) [3] assume that clusters are clean

of noise, hyper-spherical, similar in size, and span the whole dataspace. Robust clustering techniques have recently been proposedto handle noisy data. Another limitation of most clustering algo-rithms is that they assume that the number of clusters is known.However, in practice, the number of clusters may not be known.This problem is called unsupervised clustering . A recent explosionof applications generating and analyzing data streams has addednew unprecedented challenges for clustering algorithms if they areto be able to track changing clusters in noisy data streams usingonly the new data points because storing past data is not even anoption [2, 1, 4, 7].

In this paper, we propose a new AIS learning approach forclustering noisy multi-dimensional stream data, called TECNO-STREAMS ( Tracking E volving C lusters in NO isy Streams), thataddresses the shortcomings of current AIS models. Our approachexhibits improved learning abilities, while also achieving scalabil-ity, robustness, and automatic scale estimation .

The rest of the paper is organized as follows. In Section 2,we present a new dynamic AIS model based on a robust and dy-namic D-W-B-Cell model and learning algorithm designed to ad-dress the challenges of streamdata mining. In Section 3, we presenta new approach to tackle the immune network complexity and theTECNO-STREAMS algorithm. In Section 4, we formally provethe robustness propertiesof the newly proposed immune cell modelwith respect to outliers and the ability to evolvewith changing data.In Section 5, we compare our approach to some existing scalableclustering algorithms. In Section 6, we illustrate using the proposedDynamic AIS model for robust cluster detection and for miningWeb clickstream data. Finally, in Section 7, we present our conclu-sions.

2. A Dynamic Weighted B-Cell Model based on Ro-bust Weights: The D-W-B-Cell Model

In a dynamic environment, the antigens from a data streamare presented to the immune network one at a time, with the stim-ulation and scale measures re-updated with each presentation. Itis more convenient to think of the antigen index, , as monotoni-cally increasing with time. That is, the antigens are presented inthe following chronological order: . The DynamicWeighted B-Cell ( D-W-B-cell ) represents an inuence zone overthe domain of discourse consisting of the training data set. How-ever, since data is dynamic in nature, and has a temporal aspect,data that is more current will have higher inuence compared todata that is less current. Quantitatively, the inuence zone is de-ned in terms of a weight function that decreases not only with dis-tance from the antigen/data location to the D-W-B-cell prototype,but also with the time since the antigen has been presented to theimmune network. It is convenient to think of time as an additionaldimension that is added to the D-W-B-Cell compared to the clas-

sical B-Cell, traditionally statically dened in antigen space only

[12].Denition 1: (Robust Weight/Activation Function) For theD-W-B-cell, , , we dene the activationcaused by the antigen data point as

(1)

where controls the time decay rate of the contribution from oldantigens, and hence how much emphasis is placed on the currency

of the immune network compared to the sequence of antigens en-countered so far. is the distance from antigen (which is the

antigen encountered by the immune network) to D-W-B-cell,. is a scale parameter that controls the decay rate of the

weights along the spatial dimensions, and hence denes the size of an inuence zone around a cluster prototype. Data samples fallingfar from this zone are considered outliers. The weight functions de-crease exponentially with the order of presentation of an antigen, and therefore, will favor more current data in the learning process.Denition 2: (Inuence Zone) The D-W-B-cell represents asoft inuence zone, , that can be interpreted as a robust zone of inuence.

(2)

Each D-W-B-cell is allowed to have is own zone of inuencewith radial size proportional to , that is dynamically estimated.Hence, outliers are easily detected as data points falling outside theinuence zone of all D-W-B-cells or through their weak activations( ).Denition 3: (Pure Stimulation) The stimulation level, afterantigens have been presented to DWB , is dened as the density of the antigen population around DWB :

(3)

Lemma 1: (Optimal Scale Update) The equations for optimalscale updates are given by

(4)

Idea of the Proof: Since the time dependency has been absorbedinto the weight function, and assuming that the previous values,

, are constant, the equations for scale updates are found by

solving .

For the purpose of computational efciency, however, we con-vert the above equations to incremental counterparts as follows.Lemma 2: (Incremental Update of Pure Stimulation and Opti-mal Scale) After antigens have been presented to , purestimulation and optimal scale can be updated using the followingapproximate incremental equations, respectively,

(5)

(6)

where is the sum of the contributions from

previous antigens, , to D-W-B-Cell .


3/8

Idea of the Proof: Each term that takes part in the computation of the above measures (Equations (3) and (4) above) is updated indi-vidually with the arrival of each new antigen using the old values of each term in the numerator and each term in the denominator, andadding the contribution of the new antigen/data item to each one of these terms.

2.1 Dynamic Stimulation and Suppression

We propose incorporating a dynamic stimulation factor, ,in the computation of the D-W-B-cell stimulation level by adding acompensation term that depends on other D-W-B-cells in the net-work [10, 15]. In other words, a group of co-stimulated D-W-B-cells can self-sustain themselves in the immune network, even afterthe antigen that caused their creation disappears from the environ-ment. However, we need to put a limit on the time span of thismemory to forget truly outdated patterns. This is done by allowingD-W-B-cells to have their own stimulation coefcient, and to havethis stimulation coefcient decrease with their age: .

In the absence of a recent antigen that succeeds in stimulating agiven subnet, the age ( ) of the D-W-B-cell increases by 1 with

each antigen presented to the immune system. However, if a newantigen succeeds in stimulating a given cell, then its age is refreshedback to zero. This makes extremely old D-W-B-cells die gradually,if not re-stimulated by more recent relevent antigens.

Incorporating a dynamic suppression factor, , in the com-putation of the D-W-B-cell stimulation level is also a more sensibleway to take into account internal interactions. The suppression fac-tor is not related to memory management, but rather as a way tocontrol the proliferation and redundancy of the D-W-B-cell popu-lation. This adaptive way to control the amount of redundancy toachieve the right balance between the needed memory and the use-less redundancy can be achieved by allowing D-W-B-cells to havetheir own suppression coefcient, and using an annealing schedule

similar to stimulation, i.e.,In order to understand the combined effect of the proposed stim-

ulation and suppression mechanism, we consider the following twoextreme cases: ( i) Positive suppression (competition), but no stim-ulation results in good population control and avoids redundancy.However, there is no memory, and the immune network will for-get past encounters. ( ii) When there is positive stimulation, butno suppression, there is good memory but no competition. Thiswill cause the proliferation of the D-W-B-cell population or max-imum redundancy. Hence, there is a natural tradeoff between re-dundancy/memory and competition/reduced costs. Hence the totalstimulation for should take into account dynamic interac-tive co-stimulation and co-suppression from other B-cells, in addi-tion to antigen (pure) stimulation:

(7)

3. Bridging the Scalability Gap: Organization andCompression of the Immune Network

A divide and conquer strategy can have signicant impact on thenumber of interactions that need to be processed in the immune net-work. We dene external interactions as those occuring between an

antigen (externalagent) and the D-W-B-cell in the immune network

(giving rise to the rst term in (7)). We dene internal interactionsas those occuring between one D-W-B-cell and all other D-W-B-cells in the immune network(giving rise to the second and thirdterms in (7)). Note that the number of possible internal interactionscan be a serious bottleneck in the face of all existing immune net-work based learning techniques [10, 15]. Suppose that the immunenetwork is compressed by clustering the D-W-B-cells using a linearcomplexity approach such as K Means. Then the immune network can be divided into subnetworks that form a parsimonious view

of the entire network. For global low resolution interactions, suchas the ones between D-W-B-cells that are very different, only theinter-subnetwork interactions are germane. For higher resolutioninteractions such as the ones between similar D-W-B-cells, we candrill down inside the corresponding subnetwork and afford to con-sider all the intra-subnetwork interactions .

Lemma 3: (Effect of Network Compression on Scalability) Theproposed AIS based clustering model can achieve scalability at anite compression rate.

Proof: Since the AIS model is updated incrementally for each datasample, there is no need to store the data samples. What needs tobe stored and manipulated is the immune network itself. Hence, the

most expensive computation and storage overhead stems from cal-culating and storing all the internal network interactions (quadraticcomplexity with respect to the network size). However, if the net-work is divided into roughly equal sized subnetworks, then thenumber of internal interactions in an immune network of D-W-B-cells, can drop from

in the uncompressed network, tointra-subnetwork interactions and inter-subnetwork inter-actions in the compressed immune network. This clearly can ap-proach linear complexity as . Similarly the number of external interactions relative to each antigen can drop from inthe uncompressed network to in the compressed network. Fur-thermore, the compression rate can be modulated by choosing theappropriate numberof clusters, , when clustering the D-

W-B-cell population, to maintain linear complexity, . Sufcient summary statistics for each subnetwork of D-W-B-cells arecomputed, and can later be used as approximations in lieu of repeat-ing the computation of the entire suppression/stimulation sum. Thesummary statistics are in the form of average dissimilarity withinthe group, cardinality of the group (number of D-W-B-cells in thegroup), and density of the group. This approach can be seen asforming sub-networks of the immune network with sufcient sum-mary statistics to be used in evolving the entire immune network.

3.1 Effect of the Network Compression on InteractionTerms

Instead of taking into account all possible interactionsbetween all cells in the immune network, only the intra-subnetwork interactions with the

D-W-B-cells inside the par-ent subnetwork (the closest subnetwork to which this B cell is as-signed) are taken into account. In case K-Means is used, this rep-resentative as well as the organization of the network into subnet-works is a by-product. For more complex data structures, a rea-sonable best representative/prototype (such as a medoid) can bechosen. Taking these modications into account, the stimulationand scale values that take advantage of the compressed network are

given by


4/8

(8)

where is the pure antigen stimulation after encounteringantigens, given by (5 ) for D-W-B-cell ; and

is the number of B-cells in the subnetwork that is closest to the DWB-cell. Thiswill modify the D-W-B-cell scale update equations to become

(9)

where and

3.2 Cloning in the Dynamic Immune System

The D-W-B-cells are cloned in proportion to their stimulationlevels relative to the average network stimulation. However, toavoid preliminary proliferation of D-W-B-Cells, and to encouragea diverse repertoire, new D-W-B-Cells do not clone before they

are mature (their age, exceeds a lower limit ). Similarly,D-W-B-cells with age are frozen, or prevented fromcloning, to give a fair chance to newer D-W-B-Cells. This meansthat When

the B-cell population size ( ) exceeds a prespecied maximum( ), the B-cells are sorted in ascending order of their stimu-lation levels, and the top ( ) B-cells (with lowest stim-ulation) are killed. New B-cells ( ) are compensated tobe able to compete with more mature cells in the immune network by temporarily (for the purpose of sorting) scaling their stimulationlevel to the networks average stimulation level.

3.3 Learning New Antigens and Relation to Outlier De-tection

Somatic hypermutation is a powerfull natural exploration mech-anism in the immune system, that allows it to learn how to respondto new antigens that have never been seen before. However, froma computational point of view, this is a very costly and inefcientoperation since its complexity is exponential in the number of fea-tures. Therefore, we model this operation in the articial immunesystem model by an instant antigen duplication whenever an anti-gen is encountered that fails to activate the entire immune network.A new antigen, is said to activate the B-Cell, if it falls withinits Inuence Zone, , essentially meaning that its activation of

this D-W-B-Cell, exceeds a minimum threshold .Denition 4: (Potential Outlier) A Potential outlier is an antigenthat fails to activate the entire immune network, i.e.,

To further take advantage of the compressed immune network,only the

D-W-B-cells within the closest subnetwork (say thesubnetwork) to thecurrent ( ) antigen are considered. This

results in activation iff.

Theoutlier is termed potential because, initially, it may either be anoutlier or a new emerging pattern. It is only through the continuouslearning process that lies ahead, that the fate of this outlier will bedecided. If it is indeed a true outlier, then it will form no mature

DWB-Cells in the immune network.

(a) (b)

Figure 1. D-W-B-cells function (showing inuence of outliers): (a) , (b)

3.4 TECNO-STREAMS: Tracking Evolving Clusters inNoisy Data Streams with a Scalable Immune SystemLearning Model

TECNO-STREAMS Algorithm:(optional steps are enclosed in [] )Fix the maximal population size, ; Initialize D-W-B-cell population and using the rst

input antigens;Compress immune network into subnets using 2 iterations of K Means; Repeat for each incoming antigen

Present antigen to each subnet centroid, innetwork : Compute distance, activation weight, and updateincrementally using (6);

Determine the most activated subnet (the one with maximum ); IF All B-cells in most activated subnet have (antigen

does not sufciently activate subnet) THEN

Create by duplication a new D-W-B-cell = and ;

ELSE Repeat for each D-W-B-cell in most activated subnet

IF (antigen activates D-W-B-cell ) THEN Refresh age ( ) for D-W-B-cell ;

ELSE Increment age ( ) for D-W-B-cell ;

Compute distance from antigen to D-W-B-cell ;Compute D-W-B-cell s stimulation level using (8);Update D-W-B-cell s using (9);

Clone and mutate D-W-B-cells;

IF population size Then IF (Age of B-cell ) THEN

Temporarily scale D-W-B-cells stimulation level to the network average stimulation;

Sort D-W-B-cells in ascending order of their stimulation level;Kill worst excess (top ( ) according to previous

sorting) D-W-B-cells;[or move oldest/mature D-W-B-Cells to secondary (long term)

storage];

Compress immune network periodically (after every antigens),into subnets using 2 iterations of K Means with the previous cen-troids as initial centroids;


5/8


6/8

the clusters is shown in Figure 3. This scenario is the most difcult(worst) case for single-pass learning, as it truly tests the ability of the system to memorize the old patterns, adapt to new patterns, andstill avoid excessive proliferation.

Similar experiments are shown for a noisy data set of ve clus-ters in Figure 4 and Figure 5. These results illustrate the abilityof the immune network to track the clusters. Since this is an un-supervised clustering problem, it is not important that a cluster ismodeled by one or several D-W-B-cells. The effect of the com-

pression of the immune network if illustrated by showing the nalD-W-B-cell population for different compression rates correspond-ing to on the data set with 5 clusters, in Fig. 6. Linearcomplexity is reached for the case of . The antigens werepresented in the most challenging order (one cluster at a time), andin a single pass. Notice that depending on the order of presenta-tion of the data, some of the noise presented at the very end of therun may not be labeled as outliers with full certainty. It is tem-porarily treated like a possible emerging pattern until future datacan resolve its fate with certainty. Other than the natural uncer-tainty arising from last minute noise points, the quality of the nalimmune network or cluster model is practically insensitive to theorder of presentation of the data and to the compression rate, and it

shows an ability to model clusters with arbitrary shape since multi-ple B-cells can represent a single cluster. Finally Fig. 7 shows withdifferent shades the core points and outliers as easily recognized byour approach (outliers have negligible activations to the entire nalimmune network, i.e., ). Notice that in the absenceof knowledge about future data, noise presented at the very end of the run may be temporarily treated as a possible emerging pattern.This explains why a few noise points that happen to be presentedlast are shown in black at the end of the run. Naturally, becauseof the different order of presentation of the samples (rst to lastcluster, last to rst cluster, and random), these last undecided noisepoints may differ from one image to another in Fig. 7 (a), (b), and(c), respectively. The spaces that are void of data contain the rstpoints (unplotted) used to initialize the immune network.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

(a) (b) (c)

Figure 2. Single Pass Results on a Noisy dataset presented one at a time in random order: Location of D-W-B-cells and estimated scales for data set with 3 clusters after processing (a) 100 samples, (b) 700 samples, and (c) all 1133samples

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

(a) (b) (c)

Figure 3. Single Pass Results on a Noisydataset presented oneat a time in thesame order as clusters: Locationof D-W-B-cells and estimated scales for data set wit h 3 clusters after processing (a) 100 samples, (b) 300 samples, and (c)all 1133 samples

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

(a) (b) (c) (d)

Figure 4. Single Pass Results on a noisy dataset presented one at a ti me in random order with :Location of D-W-B-cells and estimated scales after processing (a) 875 samples, (b) 1750 samples, and (c) 2625 samples,(d) all 3200 samples

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

(a) (b) (c) (d)

Figure 5. Single Pass Results on a noisy dataset presented one at a time in the same order as clusters with: Location of D-W-B-cells and estimated scales for data set after processing (a) 350 samples, (b) 1225 samples,

and (c) 1925 samples, (d) all 3200 samples

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

(a) (b) (c)

Figure 6. Effect of Compression rate on Immune Network: Location of D-W-B-cells and estimated scales fornoisy data set presented in random order(a) , (b) , (c) (linear complexity)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Core Points Outlier Points

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1


(a) (b) (c)

Figure 7. Ability to distinguish between core and outlier points ( ) for noisy data setpresented in the followingorder: ( a) cluster 1 to cluster 5 (b) cluster 5 to cluster 1 (c) random order

6.2 Single-Pass Mining of User Proles from Real WebClickstream Data

Proles were mined from the 12-day clickstream data (from1998) with 1704 sessions and 343 URLs from the website of thedepartment of Computer Engineering and Computer Science atthe University of Missouri. This is a benchmark data set usedin [13, 14]. The proles that were discovered using TECNO-STREAMS in a single pass are comparable to the ones previouslyobtained using a variety of less scalable techniques [13, 14]. Themaximum population size was 100, the control parameter for com-pression was varied between and 50, and periodical com-pression every sessions. The activation threshold was

, and . We illustrate the continuous learning


7/8


8/8

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

0 200 400 600 800 1000 1200 1400 1600

Figure 9. Hits per usage trend versus session number when sessions are presented in reverse order from trend19 to trend 0, ,

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

0 200 400 600 800 1000 1200 1400 1600

Figure 10. Hitsper usagetrendversus sessionnumber whenallsessionsare presented innaturalchronologicalorder: ,

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

0 200 400 600 800 1000 1200 1400 1600

Figure 11. Distribution of input sessions over usage trend versus session number when only non-noisy(

) sessions are presented in natural chronological order: Trends 5, 9, 13, 14, 15, and 19 appear to beweakier and noisier. Also trends 6 and 7 emerge late in the 12-day access log, while trend 0 weakens in the last days.

Figure 12. Number of discovered proles versus specifed compression rate K, when sessions are presentedin natural chronological order

consuming to manipulate. The proposed compression mechanismand new B-cell model make this network manageable, and contin-uous learning possible. Finally, we note that TECNO-STREAMSadheres to the requirements of clustering data streams formulatedin [2]: compactness of representation , fast incremental processingof new data points , and clear and fast identication of outliers .

8 Acknowledgments

This work is supported by National Science Foundation CA-REER Award IIS-0133948 to O. Nasraoui, and by a grant fromthe Southeastern Consortium for Electrical Engineering Education.F. Gonzalez is supported by the National University of Colombia.

References

[1] S. Babu and J. Widom. Continuous queries over data streams. InSIGMOD Record01 , pages 109120, 2001.

[2] D. Barbara. Requirements for clustering data streams. ACMSIGKDD Explorations Newsletter , 3(2):2327, 2002.

[3] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithmsto large databases. In Proceedings of the 4th international conf. onKnowledge Discovery and Data Mining (KDD98) , 1998.

[4] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In 2002 Int. Conf. on Very Large Data Bases (VLDB02) , Hong Kong, China,2002.

[5] I. Cohen. Tending Adams Garden . Springer Verlag, 2000.[6] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algo-

rithm for discovering clusters in large spatial databases with noise. InProceedings of the 2nd international conf. on Knowledge Discoveryand Data Mining (KDD96) , Portland Oregon, 1996.

[7] S. Guha, N. Mishra, R. Motwani, and L. OCallaghan. Clusteringdata streams. In IEEE Symposium on Foundations of Computer Sci-ence (FOCS00) , Redondo Beach, CA, 2000.

[8] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.

Robust Statistics the Approach Based on Inuence Functions . JohnWiley & Sons, New York, 1986.

[9] A. Hinneburg and D. A. Keim. An efcient approach to clusteringin large multimedia databases with noise. In Knowledge Discoveryand Data Mining , pages 5865, 1998.

[10] J. Hunt and D. Cooke. An adaptative, distributed learning system,based on immune system. In IEEE International Conference on Sys-tems, Man and Cybernetics , pages 24942499, Los Alamitos, CA,1995.

[11] N. K. Jerne. The immune system. Scientic American , 229(1):5260, 1973.

[12] O. Nasraoui, D. Dasgupta, and F. Gonzalez. An articial immunesystem approach to robust data mining. In Genetic and EvolutionaryComputation Conference (GECCO) , pages 356363, New York, NY,

2002.[13] O. Nasraoui, H. Frigui, R. Krishnapuram, and A. Joshi. Mining

web access logs using relational competitive fuzzy clustering. In Eighth International Fuzzy Systems Association Congress , Hsinchu,Taiwan, Aug. 1999.

[14] O. Nasraoui and R. Krishnapuram. One step evolutionary mining of context sensitive associations and web navigation patterns. In SIAMconference on Data Mining , pages 531547, Arlington, VA, 2002.

[15] J. Timmis, M. Neal, and J. Hunt. An articial immune system fordata analysis. Biosystems , 55(1).

[16] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efcient dataclustering method for large databases. In ACM SIGMOD Interna-tional Conference on Management of Data , pages 103114, NewYork, NY, 1996. ACM Press.

nasraoui icdm03 immune

Documents