somwise regression: a new clusterwise regression method

13
ORIGINAL ARTICLE SOMwise regression: a new clusterwise regression method Jorge Muruza ´bal Diego Vidaurre Julia ´n Sa ´nchez Received: 1 September 2010 / Accepted: 27 January 2011 / Published online: 17 February 2011 Ó Springer-Verlag London Limited 2011 Abstract We present a novel neural learning architecture for regression data analysis. It combines, at the high level, a self-organizing map (SOM) structure, and, at the low level, a multilayer perceptron at each unit of the SOM structure. The goal is to build a clusterwise regression model, that is, a model recognizing several clusters in the data, where the dependence between predictors and response is variable (typically within some parametric range) from cluster to cluster. The proposed algorithm, called SOMwise Regression, follows closely in the spirit of the standard SOM learning algorithm and has performed satisfactorily on various test problems. Keywords Clusterwise regression CWR SOMwise Clustering SOM Neural networks SOMwiseR 1 Introduction Clusterwise regression (CWR) is a well-known supervised learning paradigm in many research areas. In a nutshell, CWR assumes that (1) the data contain a (typically unknown) number of clusters and (2) different response models apply within each cluster. The goal is to identify how many clusters the data are most likely to contain, as well as to provide predictive models for each inferred cluster as a whole. Traditionally, standard linear regression models have been used to do this, and least squares, max- imum likelihood and expectation-maximization (EM) ideas have provided the basis for the various training schemes available. For a detailed presentation of several CWR models and their respective fitting algorithms, see [6]. A frequently observed limitation of these methods is that clusters are usually inferred on the basis of the predictors alone (the x-part), thus entirely disregarding the available information about the response (y). The same issue arises in other neural network models like the radial basis function network models, where the weights of the hidden layer are typically determined by prior calls to k-means or other methods [2]. Results of tests of the predictive power of clusters determined in this way are often poor. This has motivated the proposal of, for example, biobjective pro- gramming techniques [3], where the goal is to optimize a linear combination of cluster quality (again in the predictor space) and predictive success based on that cluster arrangement. The self-organizing map (SOM) [10] paradigm has also been employed in CWR. In his book, Kohonen [10] briefly discusses some ideas on how to design a sensible SOM for supervised learning (or supervised SOM). Some relevant early work is pointed out, and some application areas are highlighted. The main motivation is the observation that getting response information involved in the training phase (or standard SOM) is helpful for improving classification accuracy in some cases. In this view, first, x and y are concatenated to yield new x s . Then, the basic SOM prin- ciples are applied as usual over the set of extended x s vectors. Although this idea seems to have been dormant for a while, Xiao et al. [25] have investigated this supervised method (which they call sSOM) in drug discovery problems. J. Muruza ´bal: Deceased J. Muruza ´bal J. Sa ´nchez Universidad Rey Juan Carlos, Madrid, Spain D. Vidaurre (&) Computational Intelligence Group, Universidad Polite ´cnica de Madrid, Madrid, Spain e-mail: diego.vidaurre@fi.upm.es 123 Neural Comput & Applic (2012) 21:1229–1241 DOI 10.1007/s00521-011-0536-3

Upload: independent

Post on 17-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ORIGINAL ARTICLE

SOMwise regression: a new clusterwise regression method

Jorge Muruzabal • Diego Vidaurre •

Julian Sanchez

Received: 1 September 2010 / Accepted: 27 January 2011 / Published online: 17 February 2011

� Springer-Verlag London Limited 2011

Abstract We present a novel neural learning architecture

for regression data analysis. It combines, at the high level,

a self-organizing map (SOM) structure, and, at the low

level, a multilayer perceptron at each unit of the SOM

structure. The goal is to build a clusterwise regression

model, that is, a model recognizing several clusters in the

data, where the dependence between predictors and

response is variable (typically within some parametric

range) from cluster to cluster. The proposed algorithm,

called SOMwise Regression, follows closely in the spirit of

the standard SOM learning algorithm and has performed

satisfactorily on various test problems.

Keywords Clusterwise regression � CWR � SOMwise �Clustering � SOM � Neural networks � SOMwiseR

1 Introduction

Clusterwise regression (CWR) is a well-known supervised

learning paradigm in many research areas. In a nutshell,

CWR assumes that (1) the data contain a (typically

unknown) number of clusters and (2) different response

models apply within each cluster. The goal is to identify

how many clusters the data are most likely to contain, as

well as to provide predictive models for each inferred

cluster as a whole. Traditionally, standard linear regression

models have been used to do this, and least squares, max-

imum likelihood and expectation-maximization (EM) ideas

have provided the basis for the various training schemes

available. For a detailed presentation of several CWR

models and their respective fitting algorithms, see [6].

A frequently observed limitation of these methods is that

clusters are usually inferred on the basis of the predictors

alone (the x-part), thus entirely disregarding the available

information about the response (y). The same issue arises in

other neural network models like the radial basis function

network models, where the weights of the hidden layer are

typically determined by prior calls to k-means or other

methods [2]. Results of tests of the predictive power of

clusters determined in this way are often poor. This has

motivated the proposal of, for example, biobjective pro-

gramming techniques [3], where the goal is to optimize a

linear combination of cluster quality (again in the predictor

space) and predictive success based on that cluster

arrangement.

The self-organizing map (SOM) [10] paradigm has also

been employed in CWR. In his book, Kohonen [10] briefly

discusses some ideas on how to design a sensible SOM for

supervised learning (or supervised SOM). Some relevant

early work is pointed out, and some application areas are

highlighted. The main motivation is the observation that

getting response information involved in the training phase

(or standard SOM) is helpful for improving classification

accuracy in some cases. In this view, first, x and y are

concatenated to yield new xs. Then, the basic SOM prin-

ciples are applied as usual over the set of extended xs

vectors. Although this idea seems to have been dormant for

a while, Xiao et al. [25] have investigated this supervised

method (which they call sSOM) in drug discovery

problems.

J. Muruzabal: Deceased

J. Muruzabal � J. Sanchez

Universidad Rey Juan Carlos, Madrid, Spain

D. Vidaurre (&)

Computational Intelligence Group,

Universidad Politecnica de Madrid, Madrid, Spain

e-mail: [email protected]

123

Neural Comput & Applic (2012) 21:1229–1241

DOI 10.1007/s00521-011-0536-3

Melssen, Wehrens, and Buyden [14] criticize Xiao

et al.’s development and provide two architectures for

alternative SOM-based supervised learning. These archi-

tectures each maintain two separate (but otherwise identi-

cally structured) SOMs, say x-SOM and y-SOM, so that

x and y are only compared to units in their own SOM. By

combining the usual distances between x and the pointers

in the x-SOM, on the one hand, and y and the pointers in

the y-SOM on the other, the XYF approach gets a ‘‘fused’’

overall similarity measure. This leads by maximization to

the common winning unit in both SOMs. The idea behind

the slightly more complex Bidirectional Kohonen Network

(BDK) alternative, also discussed in [14], is similar.

Some other approaches relating the SOM to classifica-

tion tasks were developed later. Villmann et al. [23] adapt

the learning scheme shown in [8], where the SOM map is

trained by gradient descending a cost function, adding a

classification accuracy term to that cost function. Herrmann

and Ultsch [7] perform semi-supervised classification on a

previously trained SOM map with one neuron per data

instance. Chtourou et al. [4] build a hybrid neural network

based on a one-dimensional SOM and a set of recurrent

neural networks. Each SOM output neuron corresponds to

a local recurrent neural network. Combining SOM net-

works and Multi-Layer Perceptron (MLP) classifiers, Tsi-

mboukakis and Tambouratzis [17] address the document

classification problem. The SOM paradigm is used in a

previous stage of the learning process, whereas MLPs are

later used to make the desired classification.

Recently, Tokunaga and Furukawa [16] have proposed a

related architecture called the Modular Network SOM or

mnSOM. The mnSOM model has a high-level SOM

structure and an MLP at each unit. MLP learning is dif-

ferentiated by varying the learning rate. However, the

number of clusters is assumed to be known. Furthermore,

all data-to-cluster membership information is also assumed

to be known from the outset, with the result that there is no

clustering problem at all. Data items are presented in bat-

ches (corresponding to the clusters), and a different winner

is selected for each cluster. MLP learning is also of batch

type: the whole training sample is used in each update, and

each unit learns the data from the various clusters at dif-

ferent learning rates.

The approach we propose, the SOMwise Regression

method (SOMwiseR for short), can be viewed as an

alternative, flexible idea to the CWR modeling problem.

We consider the simultaneous learning of cluster structure

and local predictive models, and we let the predictive

success of the tentative local models guide the emergence

of the clusters. As in [16], our architecture combines, at the

high level, a SOM structure, and, at the low level, an MLP

at each unit of the SOM structure. The familiar SOM

clustering ability is used to deal with the problem of an

unknown number of clusters in the data. The SOM topol-

ogy will ideally reflect the predictor-response structure of

the data, which can be of help in complex scenarios. The

finer approximation ability of the MLP is used to provide

the required predictive power at each cluster. The MLP is

definitely a natural choice because it has a convenient

parameter (learning rate) to implement the main heuristic,

but some other learning paradigms could easily be tried

too. Although we work here only on regression, this

framework may also be applied to classification by adapt-

ing the learning scheme at the low level.

Unlike mnSOM, SOMwiseR estimates the cluster

structure from data. Compared to other CWR methods,

SOMwiseR does not need any previous knowledge about

the underlying cluster structure and provides a topology

that can be of interest in many analysis. SOMwiseR has

many potential real-world applications. In a marketing

scenario [3], for instance, say we had a list of consumers

where the dependent variable is the response to a certain

advertising campaign or the amount of purchased product.

Consumers may be divided into unknown segments. Here,

SOMwiseR can be used to identify and analyze this seg-

mentation. In the complex bioinformatics domain [12], a

SOMwiseR model may provide information about the

nature of the data that other paradigms cannot capture. For

example, Srivastava et al. [15] analyze microarray data to

regress some physiological response. They assume that

there is an unknown number of latent regression models in

the data, corresponding to different clusters. Clusters and

regression weights are estimated via de EM algorithm. We

believe that SOMwiseR could be a valuable alternative to

this kind of analysis.

The remainder of the paper is organized as follows.

Section 2 sets out the notation and introduces the algo-

rithm. Section 3 gives some details about the evaluation

and interpretation of the model. Section 4 outlines the set

of both synthetic and real experiments used to test the

algorithm. Finally, in Sect. 5, we discuss some conclusions

and some final ideas.

2 SOMwise regression

2.1 Basic algorithm

We now describe a number of basic aspects of the proposed

SOMwiseR approach. Let D ¼ zn ¼ xn; ynð Þ; n ¼ 1; . . .;Nf gdenote the available data, where xn is a vector of p real-

valued predictors and yn is a scalar response. Of course,

multivariate and categorical response data can also be

tackled with MLP units requiring only minor changes to

what follows. The data are all assumed to be i.i.d., that is,

items zn are assumed to come from the same distribution

1230 Neural Comput & Applic (2012) 21:1229–1241

123

(which could be a mixture distribution), and to be distrib-

uted independently of one another. The goal is to investigate

the number and nature of possible clusters in D, where each

cluster refers here to a different model of predictor-response

dependence.

We assume the usual 2D lattice of neurons (although the

ideas extend naturally to higher-dimensional maps) that

have for simplicity’s sake the familiar rectangular topol-

ogy. Let us denote units by their coordinates (i, j). Each

SOMwiseR neuron has an attached MLP model Mij. Let

Wij collectively denote the set of weights attached to per-

ceptron Mij. We denote the various learnable weights and

thresholds in Wij as wijk; k ¼ 1; . . .;K.

Training is based on the sequential, one-at-a-time,

online processing of the zn items. Let zðtÞ 2 D denote the

selected item for training at iteration cycle (or epoch) t. All

MLPs take the given x(t) and compute their output in par-

allel using the current values of the weights. Thus, the

individual error for each neuron (i, j) is computed as

eðtÞij ¼ MijðWðtÞ

ij ; xðtÞÞ � yðtÞ

���

���: ð1Þ

We can then use the eij(t) values to guide the selection of

this cycle’s winning unit or best-matching unit (BMU),

denoted ði�; j�Þ.

ði�; j�Þ ¼ argminijðeðtÞij Þ: ð2Þ

Note that both x(t) and y(t) are needed to determine the

BMU ði�; j�Þ. The BMU provides the best fit for the current

training item, so that ði�; j�Þ and its near neighbors will pay

more attention to the training item z(t) than units lying

further away.

The standard MLP learning rate parameter will be used

to instantiate the general topological SOM learning tenet:

‘‘learn more if you are close to the winner, learn less (or

nothing at all) if you are far’’. Let gðtÞij � 0 denote the

learning rate used by each MLP when learning the current

data item. Then, gij(t) will have a maximum at ði�; j�Þ and

will gradually decrease as the distance to the BMU (in map

space) grows. Each MLP is trained in parallel using the

corresponding rate gij(t). Exactly the same rate is used for all

parameters in the same Wij(t). To compute gij

(t), we use the

familiar Gaussian decay function

gðtÞij ¼ aðtÞ exp�dði; j; i�; j�Þ

2rðtÞ; ð3Þ

where dði; j; i�; j�Þ reflects map distance to the current

BMU, a(t) is the maximum learning rate at iteration t

(reserved for BMUs only), and r(t) is the radius parameter

controlling the peakness of the Gaussian at iteration t. a(t)

and r(t) decrease linearly as usual.

To promote a stable map, we can consider a batch

version by normalizing the neighborhood function:

gðtÞij ¼ aðtÞexp

�dði;j;i�;j�Þ2rðtÞ

PNn¼1 exp

�dði;j;i�n;j�nÞ2rðtÞ

; ð4Þ

where ði�n; j�nÞ represents the winning neuron for data item

zn. Note that batch normalization requires the full training

data set to be available from scratch.

MLPs are updated by the backpropagation algorithm

DwðtÞijk ¼ w

ðtþ1Þijk � w

ðtÞijk ¼ �gðtÞij

oeðtÞ2ij

owijk; k ¼ 1; . . .;K: ð5Þ

Following this procedure, the currently best-matching

units for the training item are adapted, so that the same

behavior will be observed again in the future. Conversely,

units that are not a good match for the current data item

will basically be left alone. The idea is to give the units

some means of selecting the particular data that they will

use to effectively learn their MLP weights.

The pseudocode in Algorithm 1 roughly outlines the

basic online method.

2.2 Refinements

To some extent, the above training scheme is very similar

to that of the standard SOM. We present here some mod-

ifications that greatly improve the algorithm performance.

Specifically, we deal with the selection of the winning

neuron and how to minimize the risk of MLPs getting

trapped in local minima. We also discuss the impact of the

SOMwiseR parameters choice.

Traditionally, the BMU is selected as the closest neuron

to the input, that is, the best match is measured directly in

terms of information at each unit separately. Here, we have

Algorithm 1 SOMwise Regression

Input: training data set D with p variables and n cases

Input: initial maximum learning rate a(0)

Input: initial radius r(0)

Input: number of iterations T

Output: SOMwiseR map

Generate random weights w(1)ijk

for t = 1 to T do

Select a random data item zðtÞ 2 D

Compute eij(t) at each unit for z(t)

Find the BMU ði�; j�Þ ¼ argminijðeðtÞij Þ

aðtÞ :¼ að0Þð1� tTÞ

rðtÞ :¼ 1þ ðrð0Þ � 1ÞT�tT

gðtÞij ¼ aðtÞ exp�dði;j;i� ;j�Þ

2 rðtÞ, for all neurons

wðtþ1Þijk ¼ w

ðtÞijk � gðtÞij

oeðtÞ2ij

owijk, for all neurons, k ¼ 1; . . .;K

end for

Neural Comput & Applic (2012) 21:1229–1241 1231

123

found that the exact replica of this idea, namely the lowest

eij, is insufficient for our purposes. It turns out that even if

the appropriate unit does a good shot at the target, another

unit may get closer by pure chance. This phenomenon is

more frequent for large maps and in the early phases, when

randomness in unit distributions is high. This not only

deprives the MLP in question (and its neighbors) of some

helpful learning, but it also potentially corrupts the learning

processes at and around the opportunistic MLP. As a result,

low individual error looks to be too noisy a criterion for

BMU selection.

Instead, we consider a more robust error measure that

includes a (small) neighborhood of (i, j), induced by a

small radius rerror:

EðtÞij ¼X

i0;j0eðtÞi0j0 exp

�dði; j; i0; j0Þ2rerror

: ð6Þ

Now, as soon as the SOMwiseR map begins to build the

relevant clusters, the ‘‘true’’ BMU will be surrounded by

units with close predictions. This will tend to keep its Eij

low. On the other hand, any opportunistic unit will be

typically surrounded by foreign units with a larger

predictive variance, which will raise the Eij summation.

Using the minimization of (6) to select, the BMU appears

to reduce substantially the number of harmful distractions.

A similar issue has also been discussed for the traditional

SOM [20].

A related issue refers to the familiar border effect: the

most relevant units in each cluster tend to lie on the edges

of the map. This is because not all the neurons have the

same number of neighbors. In the early stages, especially, a

bad MLP could severely decrease the average fitness of the

neighborhood, making edge neurons more biased toward

winning the data item. A smaller neighborhood has a lower

probability of including a bad neuron. Here, we use a

toroidal map to address the border effect, so that the

neighborhood of all neurons will have a common size.

Besides, the susceptibility of an MLP model to get

trapped in local minima seriously threatens the continuity

of the SOMwiseR map. To deal with this, we can employ

models that alleviate the problem at the MLP level [9]. In

this paper, we try to avoid the problem at the higher SOM

level. Tokunaga and Furukawa [16] state an equivalent

problem and give a method to minimize this risk. We will

include a similar approach in our algorithm. Let us define

the energy function

/ðtÞij ¼XN

n¼1

eðtÞij ðnÞ exp

�dði; j; i�ðnÞ; j�ðnÞÞ2rðtÞ

; ð7Þ

where ði�ðnÞ; j�ðnÞÞ is the BMU for the n-th data item at

iteration t and eij(t)(n) is the error of neuron (i, j) for this data

item.

Each neuron (i, j) is supposed to minimize /ij(t)

throughout the training process. We may heuristically

identify a neuron affected by a local minimum, when, if we

substitute its corresponding MLP by any of its neighbors,

the energy function at this point decreases. The procedure

in [16], which we follow here, is to periodically check the

map looking for all the neurons meeting the condition

/ðtÞij \� � /ðtÞ0

ij ; ð8Þ

where � is a constant (typically around 0.8) and /ij(t)0 is the

energy function if we substitute the MLP in (i, j) by a copy

of the MLP of some neighbor. In this case, the MLP in

(i, j) is effectively replaced by a copy of the neighbor’s

MLP.

Note that this procedure is only feasible when the

complete data set is fully available at any time during

training. For large data sets, however, the evaluation of the

energy function (7) for all neurons becomes expensive. In

this case, this should be seldom applied during the

training.

Finally, the MLP structure (the number of MLP hidden

layers and neurons) is the most sensitive configuration

setting and has to be carefully chosen. Note that, when the

data dimension is high, data points get sparser and it may

happen that the input distribution is different for each

cluster. Then, a single MLP can probably represent dif-

ferent clusters, that is, different input–output functions. In

this case, it is important to reduce the models’ power. For

example, we can use single perceptrons so that units cor-

respond to linear functions. In this case, the model is

related to Kohonen’s Operator Map [10]. Unlike SOMw-

iseR (and like mnSOM), the Operator Map was designed to

process sequences of data or batches, where each sequence

is known to belong to the same cluster. Inspecting how

many clusters are deployed by the model for some amounts

of units complexity can give useful insight about the data

and the respective input–output functions.

On the other hand, selection of a(0) and r(0) parameters

is very much like in the traditional SOM.

3 Analysis

We now deal with how to interpret and validate a trained

SOMwiseR model. We also discuss its predictive potential.

For interpretation purposes, we propose a two-step

procedure. We first try to both identify the number of

clusters and clarify the map units membership of clusters.

This step is clearly related to the familiar step in SOM-

based analysis. Then, we look at the predominant predic-

tive MLP models encoded within each cluster. This will

entail a preselection of some units as the most trustworthy

1232 Neural Comput & Applic (2012) 21:1229–1241

123

predictors. Overall, this step should not be much more

complex than interpreting a single standard MLP model.

To ascertain the number of clusters present in the map,

we perform a final pass on D (using (6)) and obtain the

dataload matrix K ¼ kij

� �

, where kij equals the number of

data items zn won by map unit (i, j). We focus on the units

with the largest kij as the ‘‘epicenters’’ of the clusters in the

map. We also look for borders between clusters made up of

units with low kij. Ideally, each cluster exhibiting a dif-

ferent behavior for the response should translate into a

single neuron or, most likely, a few neurons close together

in an area of the map. Many intermediate neurons should

contain mixed-up, garbage MLPs and win no data at all,

meaning that the final picture should be easy to interpret.

Although crucial for SOMwiseR postprocessing, K does

not usually play a major role in the analysis of the standard

SOM, yet it is particularly useful for judging when the

SOM reaches the equiprobabilistic state [20]. Note that

SOMwiseR does not pursue the equiprobabilistic state.

Inside the tentative clusters, we care most about those

units with the largest kij. Each of these units has provided

the model with the largest scope of application and should

be examined first. They constitute the best explanation for

the response over the subset of data that really belong to

that cluster. Complementary to K, we can also consider the

average prediction error matrix (using (1)) over the set of

items won by the unit. Note that the average error is

analogous to the mean quantization error in the standard

SOM. Units with the lowest average error would appear to

entail the most precise MLPs and therefore should also be

checked out.

Summing up, a valid postprocessing method could be as

follows. First, we obtain K and the average prediction error

matrix. Then, we keep those groups of contiguous neurons

that altogether sum a significant amount of won data items

(for example, 5% of the total data items). We can also

consider those groups of neurons whose average prediction

error is under a given threshold. Finally, we select the most

significant MLP for each cluster. This is, for example, the

MLP with the largest kij within the cluster.

The quality of the model should be verified before all

analyses. A key aspect of a standard SOM is the preser-

vation of the topological order [22]. In a nutshell, a well-

organized SOM model is expected to project close data

onto the same or nearby neurons, assuring that the map is

consistent with regard to the data set distribution. The

topographic product (TP) [1] and the directional product

(DP) [21] are two methods for quantitatively evaluating the

topological order preservation. Both are constructed on the

basis that close SOM weights (in the input data space)

should correspond to close neurons (in the map grid space).

A map with perfect topological order has a zero TP

coefficient, whereas the DP coefficient equals 1.0. Worse

maps will correspond to higher absolute values of TP and

lower DP values (rarely under 0.9 though).

To apply this kind of measures to a SOMwiseR model, a

distance function between MLPs should be defined to

replace the Euclidean distance between the traditional

SOM weights. Here, we propose instead to represent each

MLP Mij by a vector mðtÞij formed by the responses that Mij

produces from some data at iteration cycle t:

mðtÞij ¼ Mij W

ðtÞij ;R

� �

; ð9Þ

where R is a set of vectors in the predictor space. R can be

either a subset of the original training data set or a ran-

domly generated set of p-elements vectors. Similar MLPs

are expected to produce an analogous vector of responses

from the same input data. Now, any of the above measures

can be applied over the response vectors mðtÞij .

A final issue (not related to CWR structure detection) is

whether we can use the trained SOMwiseR model to make

point predictions. Specifically, given a new xN?1 vector,

the question is which MLP or MLPs should be used for

prediction of the associated response. Note that individual

predictions of this sort are not straightforward, for we

cannot project xN?1 directly onto the map. Indeed, the

clustering of zn data items in the SOMwiseR map disre-

gards the location of the xn vectors in the x-space: as long

as the associated response follows the same model, it is

irrelevant how shattered (in the x-space) the cluster may

become.

Although the quality of point prediction is not the reason

why one would want to use the SOMwiseR approach, there

are some sensible approaches to the issue of prediction. For

example, we can examine the j nearest neighbors of the

input xN?1, where j depends on the amount of data and the

cardinality of the cluster. If all these zn data project onto

the same unit or neighboring units, we expect the corre-

sponding MLPs to provide helpful predictions for xN?1 too.

This is because, although we have stressed that the clusters

can arbitrarily expand in x-space, some basic continuity

could certainly be expected. Otherwise, if the j selected

neighbors project onto distant units on the map, then we

must abstain from predicting, as the selected input vector

belongs to an ambiguous area where several clusters may

overlap in the x-space.

4 Experiments

In this section, we consider two artificial scenarios for

comparing the proposed algorithm to two state-of-the-art

algorithms, XYF and BDK [14]. Topological order

Neural Comput & Applic (2012) 21:1229–1241 1233

123

preservation is evaluated by the DP and TP measures. Also,

we compare SOMwiseR to some not topological preserving

CWR methods over the real Electricity data set [13].

4.1 Synthetic data

The first scenario is a family of data sets suitable for testing

different abilities in the CWR setting. Sampling is based on

a mixture distribution, say f(x, y), reflecting some type of

clusters in the (x, y)-space. In this case, we consider two

clusters and two predictors. The data sets contain 15,000

items.

All mixture components in the (x, y)-space are sampled

according to f(y|x)f(x). The two f(x) densities associated

with the clusters are Gaussians N2ð0; 4 � IÞ and N2ðl; 4 � IÞ,respectively, for some 0\l� 10. We consider the case of

no overlap (l = 10) as well as the overlap case (l\ 10).

The response within clusters is expressed as y = /(v0x) ? n, where n follows a N(0,s2) distribution with s� 0

and /(v0x) = 1/(1 ? exp(- v0x)). We thus consider the

noise case (called output noise) when s[ 0. Key vectors v0

and vl are chosen, so that the random variates /(v00x) and

/(vl0x) have the same mean and variance. The sigmoid

link / is chosen to ensure a simpler evaluation of the

learned MLP models. We also address the background

noise scenario, where f(x) is uniform over the square

(- 10, 20) 9 (- 10, 20), entirely covering the effective

support of the two Gaussians in all cases, and the noise

response y is distributed across the unit interval (regardless

of x). If we denote the sampling proportions as p0; pl and

pb ¼ 1� ðp0 þ plÞ, then we can consider the cases pb [ 0

and pb = 0. We also consider the balanced case p0 ¼ pl as

well as the rare case where one of the clusters is more

present than the other one [24]. Table 1 shows the various

cases that we have analyzed. At this stage, we aim to gain a

basic understanding of the method’s sensitivity by ana-

lyzing one possible source of difficulty at a time.

For this family of data sets, we use 10 9 10 squared

maps. MLPs are provided with one hidden layer and two

hidden units for the MLPs. All MLPs are randomly ini-

tialized by independently sampling each MLP weight wijk

from a uniform distribution. The hidden units have sigmoid

activation, and the output unit is linear. Mimicking the

well-known heuristic for the ordinary SOM, we carry out

training in two phases, of 10,000 and 100,000 cycles,

respectively. In the first phase, the initial radius r(0) is set to

5 and a(0) = 0.2. In the second phase, r(0) is shrunk to 2.5

and a(0) = 0.1.

Figure 1 shows an average quality run for the Basic data

set. Figure 1a shows the dataload matrix K, where darker

means higher. The two clusters emerge perfectly. Since this

is a toroidal map, the four corners of the maps are con-

tiguous. Figure 1b depicts the average error matrix (using

(1)). Using a map-based scaling, darker means lower and

white means that the neuron has not won any data item.

There are several good models in each cluster’s realm.

Figure 1c, d examine the response maps. Here, we select,

for each cluster, a training item zn at random. We then plot

the individual error (1) committed at each unit of the

SOMwiseR structure for zn. Darker means lower as in the

case of the error matrix. Two clearly differentiated

response patterns can be discerned in conformity with the

two identified clusters. As expected, we observe a smooth

growth of the error as we move away from the focus of the

cluster.

Figure 2a shows a typical map generated by the XYF

algorithm. This map is also toroidal. The quality of the

BDK maps, not shown here, is very similar. In this case,

the maps approximate the equiprobabilistic state. Fig-

ure 2b, c show the dataload matrices for each cluster. As

shown, although one of the clusters wraps the other one,

the map is perfectly divided into two parts.

Table 1 Training data sets

Basic l = 10 p0 ¼ pl pb = 0 s = 0

Overlap l\ 10

Rare-cluster p0 [ pl

Background-noise pb [ 0

Output-noise s[ 0

Fig. 1 Gray matrices output by SOMwiseR for the Basic data set: a Dataload matrix, b Error matrix, c First cluster response matrix, d Second

cluster response matrix

1234 Neural Comput & Applic (2012) 21:1229–1241

123

Table 2 shows the TP and DP coefficients over 10

executions. Note that, whereas we may directly apply DP

and TP to the XYF and BDK maps, we first need to

compute vectors mij from the SOMwiseR model by (9).

Therefore, comparisons of topological order preservation

between SOMwiseR and the other methods are only

approximate. For this data set, the DP coefficient is

better for SOMwiseR than for the other algorithms, but

the TP coefficient is worse. TP is more general purpose

than DP, whereas DP is more appropriate for SOM-like

maps [21].

We now switch to the Overlap data set (more specifi-

cally, l = 4 or the heavy overlap case). Figure 3 shows a

typical SOMwiseR execution. Again, the two clusters can

be clearly appreciated in our runs. Interestingly, the pres-

ence of heavy overlap between the clusters in the x-space

does not seem to pose a problem for the approach. Figure 4

depicts a usual BDK map. The quality of XYF maps and

BDK maps is very alike. In general, compared to the Basic

data set, since both XYF and BDK in part rely on the x-

space distribution of the data, maps generated by these

methods are worse at separating the two clusters. However,

there are no big differences in the DP value of the three

algorithms (see Table 2). Therefore, in this case, it appears

that the SOMwiseR maps represent the data more faithfully

than XYF and BDK, while also preserving the (data-

independent) topological order. Note that TP and DP again

disagree when comparing SOMwiseR with XYF and BDK.

Figure 5 presents the results for the Rare-cluster data set

ðp0 ¼ 80%; pl ¼ 20%Þ. We now observe that the split

suggested by the dataload image still works. Note that the

response maps are also very much differentiated. Figure 6

shows an XYF map where the big cluster wraps the smaller

one. Table 2 indicates that the XYF and BDK maps have

similar DP and TP values to SOMwiseR.

We also see that SOMwiseR appears to be sensitive to

the presence of noise, be this inherent to the response or the

background. When background noise is added, clusters

units are sometimes perturbed by the noisy data. This

results in blurrier dataload matrices.

Figure 7 shows a SOMwiseR run for the pb = 30%

case. The clusters are still differentiated. The error matrix

can help to find the cluster epicenters. Figure 8 shows a

BDK map. The map is scattered but separates the two

clusters to some extent.

When some output noise is added to the data, clusters

become more widespread. This is hardly surprising for a

system that heavily relies on prediction quality. Never-

theless, as shown in Fig. 9 (s = 0.09), the two clusters are

neatly separated even when the cluster areas are wider.

Figure 10 depicts an average XYF map. Although the

separation of the map is neat, the border between clusters is

twisted. Table 2 shows that, when either background or

output noise is present, the topological order is more

robustly preserved for SOMwiseR maps than for XYF and

BDK maps.

Fig. 2 Gray matrices output by

XYF for the Basic data set:

a Dataload matrix of the full

data set, b Dataload matrix of

the first cluster, c Dataload

matrix of the second cluster

Table 2 DP and TP measures of the maps trained by SOMwiseR,

XYF and BDK, for all the data sets

SOMwiseR XYF BDK

Basic TP 0.11(± 0.04) 0.08(± 0.006) 0.085(± 0.005)

DP 0.945(± 0.008) 0.928(± 0.01) 0.929(± 0.008)

l = 6 TP 0.133(± 0.02) 0.032(± 0.002) 0.033(± 0.001)

DP 0.953(± 0.008) 0.948(± 0.002) 0.95(± 0.005)

l = 4 TP 0.105(± 0.01) 0.024(± 0.002) 0.025(± 0.002)

DP 0.959(± 0.005) 0.958(± 0.002) 0.952(± 0.004)

pl = 20% TP 0.072(± 0.02) 0.062(± 0.003) 0.073(± 0.004)

DP 0.951(± 0.009) 0.956(± 0.003) 0.939(± 0.009)

pl = 10% TP 0.098(± 0.03) 0.065(± 0.006) 0.071(± 0.002)

DP 0.945(± 0.003) 0.950(± 0.006) 0.940(± 0.006)

pb = 20% TP 0.019(± 0.008) 0.048(± 0.004) 0.055(± 0.005)

DP 0.980 (± 0.004) 0.939(± 0.01) 0.927(± 0.02)

pb = 30% TP 0.018(± 0.004) 0.044(± 0.002) 0.052(± 0.002)

DP 0.984(± 0.004) 0.945(± 0.003) 0.938(± 0.009)

s = 0.05 TP 0.026(± 0.01) 0.08(± 0.008) 0.082(± 0.002)

DP 0.98(± 0.005) 0.935(± 0.01) 0.936(± 0.005)

s = 0.09 TP 0.043(± 0.02) 0.079(± 0.007) 0.089(± 0.006)

DP 0.972(± 0.007) 0.948(± 0.01) 0.933(± 0.008)

4CF TP 0.088(± 0.006) 0.092(± 0.006) 0.085(± 0.001)

DP 0.977(± 0.002) 0.933(± 0.01) 0.945(± 0.003)

Electricity TP 0.061(0.006) 0.062 (0.001) 0.063 (0.005)

DP 0.934(0.02) 0.947 (0.005) 0.935 (0.008)

The best value for each data set and type of measure is highlighted

Neural Comput & Applic (2012) 21:1229–1241 1235

123

Fig. 3 Gray matrices output by SOMwiseR for the Overlap (l = 4) data set: a Dataload matrix, b Error matrix, c First cluster response matrix,

d Second cluster response matrix

Fig. 4 Gray matrices output by

BDK for the Overlap (l = 4)

data set: a Dataload matrix of

the full data set, b Dataload

matrix of the first cluster,

c Dataload matrix of the second

cluster

Fig. 5 Gray image matrices output by SOMwiseR for the Rare-cluster ðp0 ¼ 80%;pl ¼ 20%Þ data set: a Dataload matrix, b Error matrix,

c First cluster response matrix, d Second cluster response matrix

Fig. 6 Gray matrices output by

XYF for the Rare-cluster ðp0 ¼80%; pl ¼ 20%Þ data set:

a Dataload matrix of the full

data set, b Dataload matrix of

the first cluster, c Dataload

matrix of the second cluster

Fig. 7 Gray matrices output by SOMwiseR for the Background-noise (pb = 30%) data set: a Dataload matrix, b Error matrix, c First cluster

response matrix, d Second cluster response matrix

1236 Neural Comput & Applic (2012) 21:1229–1241

123

The next data set (which we will call 4CF) counts just

one predictor and four different clusters, each corre-

sponding to a cubic function (see Fig. 11a). The 4CF data

set contains 4,000 items (1,000 for each cluster). The

predictor is sampled uniformly from the interval [-2, 2].

To deal with the 4CF data set, we shall use bigger

15 9 15 squared maps, where all MLPs have one hidden

layer and eight hidden units. In this case, we need a larger

map size, so that the four clusters have room to appear. We

again carry out two training phases, of 10,000 and 100,000

Fig. 8 Gray matrices output by

BDK for the Background-noise(pb = 30%) data set: a Dataload

matrix of the full data set,

b Dataload matrix of the first

cluster, c Dataload matrix of the

second cluster

Fig. 9 Gray matrices output by SOMwiseR for the Output-noise (s = 0.09) data set: a Dataload matrix, b Error matrix, c First cluster response

matrix, d Second cluster response matrix

Fig. 10 Gray matrices output

by XYF for the Output-noise(s = 0.09) data set: a Dataload

matrix of the full data set,

b Dataload matrix of the first

cluster, c Dataload matrix of the

second cluster

Fig. 11 a The four cubic

functions contained in the 4CFdata set. b Grid of functions

generated by SOMwiseR

Neural Comput & Applic (2012) 21:1229–1241 1237

123

cycles, respectively. The initial radius is 7.0 in the first and

3.5 in the second phase. The initial learning rate is 0.2 and

then 0.1.

Figure 11b shows the map of function shapes repre-

sented by the MLP units for one execution. The four units

with the highest number of won data items, representative

of the four clusters, are highlighted. As shown, each rep-

resentative function successfully corresponds to one of the

cubic functions in Fig. 11a. Figure 12 depicts the dataload

and average error matrices. The four clusters emerge

clearly in the dataload matrix K. As expected, the corre-

sponding cluster centers show low prediction errors. The

response map patterns also neatly reflect the cluster struc-

ture (see Fig. 13).

Figure 14a shows a BDK map. Figure 14b, c. show the

dataload matrices for the first and second clusters. The

third and fourth clusters are comparable and are not dis-

played. Note that the XYF and BDK dataload matrices

display a uniform mesh, where no clusters can be appre-

ciated. This is because both the XYF and the BDK methods

determine the winning unit by separately considering the

similarities in the predictor x-space and the similarities in

the response y-space. As shown in Fig. 11a, the densities of

the four functions are the same on the X-axis (uniformly

sampled from [-2, 2]). The densities of the four functions

are not very different on the Y-axis. There is almost no

information in the x-space and the y-space (separately

considered) to distinguish the clusters.

Furthermore, according to DP, the topology order is

better preserved for the SOMwiseR maps (see Table 2).

However, although the XYF and the BDK maps are not

informative, they preserve the topological order to some

extent.

From these synthetic scenarios, we conclude that

SOMwiseR often discovers the clusters structure, to some

extent preserving the topological order, although there do

tend to be some topology disorders in specific settings.

4.2 Real data

Finally, we have tested the method on a real-world prob-

lem, the Electricity data set [13]. This data set includes one

sampling for each U.S. state (N = 50), with three inde-

pendent variables (price of electricity, per capita income

and price of gas) and a response variable (per capita

electricity consumption). The states are known to be dis-

tributed over two segments.

Note that this is a realistic case where we do not know

the true distribution of the data items in the clusters, and,

since XYF and BDK pursue an equiprobabilistic state,

there is no evidence in the XYF and BDK maps to infer the

number of clusters and their structure. Whereas XYF and

BDK are prediction-oriented algorithms, SOMwiseR is

rather a method for discovering cluster structure.

We use 10 9 10 maps, with a single (linear) perceptron

at each unit. Training is done in two phases (15,000 and

150,000 cycles). In the first phase, r(0) is 5 and a(0) is 0.01.

In the second phase, r(0) is 2.5 and a(0) is 0.005.

Figure 15 shows the results. At first glance, the two

clusters are clearly noticeable. Two well-defined areas

appear in the dataload matrix, representing the two seg-

ments. Note that the dark units at the bottom of the map are

neighbors of the two units situated at the top. The error

matrix is helpful in this case to clarify where the epicenters

of the clusters are located. For instance, the right cluster

can be fairly well represented by the bottommost unit of the

cluster. The response maps confirm these claims.

Although not presented here, we have also trained maps

with more strong, nonlinear MLPs. Specifically, we have

trained maps where the MLPs have one hidden layer and

four hidden neurons and maps where the MLPs have two

hidden layers of, respectively, four and two hiddenFig. 12 Dataload and error matrices for the 4CF data set

Fig. 13 Response matrices for the 4CF data set, each representing one of the four functions

1238 Neural Comput & Applic (2012) 21:1229–1241

123

neurons. In both cases, whereas the model mean squared

error is not very different, the separation between the two

clusters is less clear. This may suggest that the data are

better fitted with a two linear clusters model.

The resulting matrices for the XYF and BDK algorithms

are pale gray plains and are not shown. Regarding topology

order preservation, both TP and DP measures are equiva-

lent for the three algorithms (see Table 2).

Table 3 shows a comparison between SOMwiseR and

other CWR methods, displaying the model mean squared

error and the number of discovered clusters. For SOMw-

iseR, the most representative unit of each cluster is chosen

for evaluation. Results are averaged over 10 runs. The

other CWR methods are linear regression fixed point

clustering (FPC) [6] and a maximum likelihood estimator

under a regression mixture model (MLRM) [5]. The latter,

computed by the EM algorithm, needs the number of

clusters to be specified. We have chosen the amount of

clusters that minimizes the Bayesian Information Criterion

(BIC). Besides, Table 3 displays mean squared errors for

XYF and BDK. Finally, we have also included results for

ordinary least squares, MLP with one hidden layer of four

hidden neurons and MLP with two hidden layers of,

respectively, four and two hidden neurons.

Whereas FPC always includes all the data items in one

cluster, leading to higher mean squared error, MLRM

detects five clusters and is likely to overfit the data. When

units are provided with linear models, SOMwiseR detects

two clusters in most executions. Note that the prediction

error is a bit higher than for MLRM but much lower than

for FPC (which is equivalent to a usual linear regression).

As mentioned before, when units are provided with more

complex models, SOMwiseR does not so clearly discrim-

inate the two clusters and the prediction error is not sig-

nificantly decreased. On the other hand, XYF and BDK

Fig. 14 Gray matrices for the

4CF data set: a Dataload matrix

of the full data set, b Dataload

matrix of the first cluster,

c Dataload matrix of the second

cluster. Clusters 3 and 4 are not

shown

Fig. 15 Gray matrices for the Electricity data set. a Dataload matrix, b Error matrix, c First cluster response matrix, d Second cluster response

matrix

Table 3 Mean squared error and number of clusters for the Elec-tricity data set

Method Mean squared error No. clusters

SOMwiseR (linear) 4.22(± 2.04) 2

SOMwiseR 4hn 4.37(± 0.95) 2

SOMwiseR 6hn 4.21(± 1.01) 2

FPC 20.78(± 0.0) 1

MLRM 1.17(± 0.91) 5

XYF 5.54(± 2.79) –

BDK 5.44(± 1.84) –

OLS 21.11(± 0.0) –

MLP 4hn 20.88(± 0.0) –

MLP 6hn 20.78(± 0.0) –

CWR methods: SOMwiseR with linear units, SOMwiseR whose units

have one hidden layer of four hidden neurons (SOMwiseR 4hn),

SOMwiseR whose units have two hidden layers of, respectively, four

and two hidden neurons (SOMwiseR 6hn), linear regression fixed

point clustering (FPC) and maximum likelihood under a regression

mixture model (MLRM). Non-CWR methods based on topology:

XYF and BDK. Non-CWR methods non-based on topology: ordinary

least squares regression (OLS), MLP with one hidden layer of four

hidden neurons (MLP 4hn) and MLP with two hidden layers of,

respectively, four and two hidden neurons (MLP 6hn). Results are

averaged over 10 runs

Neural Comput & Applic (2012) 21:1229–1241 1239

123

have a low model mean squared error. This is expectable

because all units are competing to give the best prediction

for each data item. Finally, note that all the non-CWR, non-

topological methods yield high mean squared errors.

Summing up, SOMwiseR has also proven to work well

in a real environment and could be a valuable tool for

gaining insight from data that other algorithms are unable

to analyze. SOMwiseR appears to be able to distinguish

real data sets, where more than one model is present, from

simpler data sets, where a single model can represent all

the data items.

5 Discussion

We have introduced a new learning architecture for the

CWR modeling problem. The novel SOMwiseR approach

seems to neatly translate the unsupervised learning prin-

ciples of the standard SOM algorithm into the supervised

learning scenario. Like the methods in [14, 16], SOMwiseR

can be considered as a fusion between supervised and

unsupervised classification. We have shown that SOMw-

iseR can learn to differentiate the functional clusters in

various both synthetic and real cases of practical interest.

So far, we have used a number of analytic tools to assess

the SOMwiseR trained model. We have also revealed the

system’s sensitivity to noise. A software package has been

developed in the R environment and is available on

request.

There are a few more ideas that have not been analyzed

in this paper. We now summarize these ideas.

To complement the dataload and error matrices, we can

also rely on truly U-like matrices for cluster identification.

Like U-Matrix [19], we can meaningfully compute the

average quadratic deviation between any two units’ outputs

over the full data set, ignoring the actual responses. Since

there is some evidence that some type of functional inter-

polation can be achieved in our MLP context [14], the

resulting U-Matrix may produce another helpful view of

the target cluster structure. The P-Matrix [18], specifically

designed for dealing with toroidal maps, displays an esti-

mation of the data space density at each neuron, much like

kernel density estimation. To compute a versioned

P-Matrix for SOMwiseR, the number of data items whose

prediction error is under some threshold for each neuron is

divided by the threshold itself. Therefore, this threshold

corresponds in the original P-Matrix to the radius of the

hypersphere whose volume is to be estimated. Note that

both the P-Matrix and the U-Matrix are specially suitable

for maps where the number of data items is related to the

number of neurons.

To promote the proper distribution of clusters over the

map, we can introduce more powerful MLPs with a larger

number of hidden neurons scattered in specific positions

across the map. These enhanced MLPs would have greater

chances of winning data and thus would typically help to

build the clusters.

The SOMwiseR map can also be seen as a mere data

preprocessor for simplifying the clustering task. In this

case, each zn would be replaced by its response map, with

the idea of displaying clusters more obviously in the

response map space. Kontkanen et al. [11] have proposed a

similar idea. They replace each xn vector with its predictive

distribution given a central Bayesian network model. They

then consider the xn vectors that lead to similar predictive

distributions as being similar. Finally, these similarities are

used to project data items onto a planar region for visual

inspection.

Finally, as already noted, some other supervised learn-

ing methods may also play the role of the MLP in the

SOMwiseR approach; see also [24]. A similar idea was

also introduced in [16].

Acknowledgments This research was partially supported by pro-

jects TIN2007-62626 and Cajal Blue Brain. We are very grateful to

Prof. Concha Bielza and Prof. Pedro Larranaga for heir valuable

support. Finally, we would like to express our very special gratitude

in the memory of the first author, Prof. Jorge Muruzabal, who devised

the idea and provided the inspiration for this and many others papers.

It was a pleasure to work with him and share his enthusiastic attitude

to the science.

References

1. Bauer H, Pawelzik K (1992) Quantifying the neighborhood

preservation of self-organizing feature maps. IEEE Trans Neural

Netw 4(3):570–579

2. Bishop CM (1995) Neural networks for pattern recognition.

Oxford University Press, Oxford

3. Brusco MJ, Cradit JD, Tashchian A (2003) Multicriterion clust-

erwise regression for joint segmentation: an application to cus-

tomer value. J Mark Res 40(2):225–234

4. Chtourou S, Chtourou M, Hammami O (2008) A hybrid approach

for training recurrent neural networks: application to multi-step-

ahead prediction of noisy and large data sets. Neural Comput

Appl 17(3):245–254

5. DeSarbo W, Cron W (1988) A maximum likelihood methodology

for clusterwise linear regression. J Classif 5:249–282

6. Hennig C (1999) Models and methods for clusterwise linear

regression. In: Gaul W, Locarek-Junge H (eds) Classification in

the information age. Springer, Berlin, pp 179–187

7. Herrmann L, Ultsch A (2007) Label propagation for semi-

supervised learning in self-organizing maps. In: 6th International

workshop on self-organizing maps, Bielefeld, Germany

8. Heskes T (1999) Energy functions for self-organizing maps. In:

Oja E, Kaski S (eds) Kohonen maps. Elsevier, Amsterdam,

pp 303–316

9. Kathirvalavakumar T, Jeyaseeli Subavathi S (2009) Neighbor-

hood based modified backpropagation algorithm using adaptive

learning parameters for training feedforward neural networks.

Neurocomputing 72(16–18):3915–3921

10. Kohonen T (2001) Self-organizing maps. Springer, Berlin

1240 Neural Comput & Applic (2012) 21:1229–1241

123

11. Kontkanen P, Lahtinen J, Myllymaki P, Silander T, Tirri H

(2000) Supervised model-based visualization of high-dimen-

sional data. Intell Data Analysis 4(3–4):213–227

12. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I,

Lozano J, Armananzas R, Santafe G, Perez A, Robles V (2006)

Machine learning in bioinformatics. Brief Bioinformat

7(1):86–112

13. McCormick R (1993) Managerial economics. Prentice-Hall,

Englewood Cliffs, NJ

14. Melssen W, Wehrens R, Buydens L (2006) Supervised Kohonen

networks for classification problems. Chemom Intell Lab Syst

83(2):99–113

15. Srivastava S, Zhang L, Jin R, Chan C (2008) A novel method

incorporating gene ontology information for unsupervised clus-

tering and feature selection. PLoS ONE 3(12):e3860

16. Tokunaga K, Furukawa T (2009) Modular network SOM. Neural

Netw 22(1):82–90

17. Tsimboukakis N, Tambouratzis G (2007) Self-organizing word

map for context-based document classification. In: 6th Interna-

tional workshop on self-organizing maps, Bielefeld, Germany

18. Ultsch A (2003) Maps for the visualization of high-dimensional

data spaces. In: Workshop on self-organizing maps, Kyushu,

Japan, pp 225–230

19. Ultsch A, Siemon H (1990) Kohonen’s self-organizing feature

maps for exploratory data analysis. In: Proceedings of the inter-

national neural networks conference, Kluwer Academic Press,

Paris, pp 305–308

20. Van Hulle MM (2000) Faithful representations and topographic

maps: from distortion- to information-based self-organization.

Wiley, New York

21. Vidaurre D, Muruzabal J (2007) A quick assessment of topology

preservation for SOM structures. IEEE Trans Neural Netw

18(5):1524–1528

22. Villmann T, Herrmann M, Martinetz T (1997) Topology preser-

vation in self-organizing feature maps: exact definition and

measurement. IEEE Trans Neural Netw 8(2):256–266

23. Villmann T, Seiffert U, Schleif F, Bruß C, Geweniger T, Hammer

B (2006) Fuzzy labeled self-organizing map with label-adjusted

prototypes, LNAI, vol 4087. Springer, Ulm, Germany, pp 46–56

24. Weiss GM (2004) Mining with rarity: a unifying framework.

SIGKDD Explor Newsl 6(1):7–19

25. Xiao Y, Clauset A, Harris R, Bayram E, Santago P, Schmitt J

(2005) Supervised self-organizing maps in drug discovery: 1.

Robust behavior with overdetermined data sets. J Chem Inf

Model 45(6):1749–1758

Neural Comput & Applic (2012) 21:1229–1241 1241

123