somwise regression: a new clusterwise regression method
TRANSCRIPT
ORIGINAL ARTICLE
SOMwise regression: a new clusterwise regression method
Jorge Muruzabal • Diego Vidaurre •
Julian Sanchez
Received: 1 September 2010 / Accepted: 27 January 2011 / Published online: 17 February 2011
� Springer-Verlag London Limited 2011
Abstract We present a novel neural learning architecture
for regression data analysis. It combines, at the high level,
a self-organizing map (SOM) structure, and, at the low
level, a multilayer perceptron at each unit of the SOM
structure. The goal is to build a clusterwise regression
model, that is, a model recognizing several clusters in the
data, where the dependence between predictors and
response is variable (typically within some parametric
range) from cluster to cluster. The proposed algorithm,
called SOMwise Regression, follows closely in the spirit of
the standard SOM learning algorithm and has performed
satisfactorily on various test problems.
Keywords Clusterwise regression � CWR � SOMwise �Clustering � SOM � Neural networks � SOMwiseR
1 Introduction
Clusterwise regression (CWR) is a well-known supervised
learning paradigm in many research areas. In a nutshell,
CWR assumes that (1) the data contain a (typically
unknown) number of clusters and (2) different response
models apply within each cluster. The goal is to identify
how many clusters the data are most likely to contain, as
well as to provide predictive models for each inferred
cluster as a whole. Traditionally, standard linear regression
models have been used to do this, and least squares, max-
imum likelihood and expectation-maximization (EM) ideas
have provided the basis for the various training schemes
available. For a detailed presentation of several CWR
models and their respective fitting algorithms, see [6].
A frequently observed limitation of these methods is that
clusters are usually inferred on the basis of the predictors
alone (the x-part), thus entirely disregarding the available
information about the response (y). The same issue arises in
other neural network models like the radial basis function
network models, where the weights of the hidden layer are
typically determined by prior calls to k-means or other
methods [2]. Results of tests of the predictive power of
clusters determined in this way are often poor. This has
motivated the proposal of, for example, biobjective pro-
gramming techniques [3], where the goal is to optimize a
linear combination of cluster quality (again in the predictor
space) and predictive success based on that cluster
arrangement.
The self-organizing map (SOM) [10] paradigm has also
been employed in CWR. In his book, Kohonen [10] briefly
discusses some ideas on how to design a sensible SOM for
supervised learning (or supervised SOM). Some relevant
early work is pointed out, and some application areas are
highlighted. The main motivation is the observation that
getting response information involved in the training phase
(or standard SOM) is helpful for improving classification
accuracy in some cases. In this view, first, x and y are
concatenated to yield new xs. Then, the basic SOM prin-
ciples are applied as usual over the set of extended xs
vectors. Although this idea seems to have been dormant for
a while, Xiao et al. [25] have investigated this supervised
method (which they call sSOM) in drug discovery
problems.
J. Muruzabal: Deceased
J. Muruzabal � J. Sanchez
Universidad Rey Juan Carlos, Madrid, Spain
D. Vidaurre (&)
Computational Intelligence Group,
Universidad Politecnica de Madrid, Madrid, Spain
e-mail: [email protected]
123
Neural Comput & Applic (2012) 21:1229–1241
DOI 10.1007/s00521-011-0536-3
Melssen, Wehrens, and Buyden [14] criticize Xiao
et al.’s development and provide two architectures for
alternative SOM-based supervised learning. These archi-
tectures each maintain two separate (but otherwise identi-
cally structured) SOMs, say x-SOM and y-SOM, so that
x and y are only compared to units in their own SOM. By
combining the usual distances between x and the pointers
in the x-SOM, on the one hand, and y and the pointers in
the y-SOM on the other, the XYF approach gets a ‘‘fused’’
overall similarity measure. This leads by maximization to
the common winning unit in both SOMs. The idea behind
the slightly more complex Bidirectional Kohonen Network
(BDK) alternative, also discussed in [14], is similar.
Some other approaches relating the SOM to classifica-
tion tasks were developed later. Villmann et al. [23] adapt
the learning scheme shown in [8], where the SOM map is
trained by gradient descending a cost function, adding a
classification accuracy term to that cost function. Herrmann
and Ultsch [7] perform semi-supervised classification on a
previously trained SOM map with one neuron per data
instance. Chtourou et al. [4] build a hybrid neural network
based on a one-dimensional SOM and a set of recurrent
neural networks. Each SOM output neuron corresponds to
a local recurrent neural network. Combining SOM net-
works and Multi-Layer Perceptron (MLP) classifiers, Tsi-
mboukakis and Tambouratzis [17] address the document
classification problem. The SOM paradigm is used in a
previous stage of the learning process, whereas MLPs are
later used to make the desired classification.
Recently, Tokunaga and Furukawa [16] have proposed a
related architecture called the Modular Network SOM or
mnSOM. The mnSOM model has a high-level SOM
structure and an MLP at each unit. MLP learning is dif-
ferentiated by varying the learning rate. However, the
number of clusters is assumed to be known. Furthermore,
all data-to-cluster membership information is also assumed
to be known from the outset, with the result that there is no
clustering problem at all. Data items are presented in bat-
ches (corresponding to the clusters), and a different winner
is selected for each cluster. MLP learning is also of batch
type: the whole training sample is used in each update, and
each unit learns the data from the various clusters at dif-
ferent learning rates.
The approach we propose, the SOMwise Regression
method (SOMwiseR for short), can be viewed as an
alternative, flexible idea to the CWR modeling problem.
We consider the simultaneous learning of cluster structure
and local predictive models, and we let the predictive
success of the tentative local models guide the emergence
of the clusters. As in [16], our architecture combines, at the
high level, a SOM structure, and, at the low level, an MLP
at each unit of the SOM structure. The familiar SOM
clustering ability is used to deal with the problem of an
unknown number of clusters in the data. The SOM topol-
ogy will ideally reflect the predictor-response structure of
the data, which can be of help in complex scenarios. The
finer approximation ability of the MLP is used to provide
the required predictive power at each cluster. The MLP is
definitely a natural choice because it has a convenient
parameter (learning rate) to implement the main heuristic,
but some other learning paradigms could easily be tried
too. Although we work here only on regression, this
framework may also be applied to classification by adapt-
ing the learning scheme at the low level.
Unlike mnSOM, SOMwiseR estimates the cluster
structure from data. Compared to other CWR methods,
SOMwiseR does not need any previous knowledge about
the underlying cluster structure and provides a topology
that can be of interest in many analysis. SOMwiseR has
many potential real-world applications. In a marketing
scenario [3], for instance, say we had a list of consumers
where the dependent variable is the response to a certain
advertising campaign or the amount of purchased product.
Consumers may be divided into unknown segments. Here,
SOMwiseR can be used to identify and analyze this seg-
mentation. In the complex bioinformatics domain [12], a
SOMwiseR model may provide information about the
nature of the data that other paradigms cannot capture. For
example, Srivastava et al. [15] analyze microarray data to
regress some physiological response. They assume that
there is an unknown number of latent regression models in
the data, corresponding to different clusters. Clusters and
regression weights are estimated via de EM algorithm. We
believe that SOMwiseR could be a valuable alternative to
this kind of analysis.
The remainder of the paper is organized as follows.
Section 2 sets out the notation and introduces the algo-
rithm. Section 3 gives some details about the evaluation
and interpretation of the model. Section 4 outlines the set
of both synthetic and real experiments used to test the
algorithm. Finally, in Sect. 5, we discuss some conclusions
and some final ideas.
2 SOMwise regression
2.1 Basic algorithm
We now describe a number of basic aspects of the proposed
SOMwiseR approach. Let D ¼ zn ¼ xn; ynð Þ; n ¼ 1; . . .;Nf gdenote the available data, where xn is a vector of p real-
valued predictors and yn is a scalar response. Of course,
multivariate and categorical response data can also be
tackled with MLP units requiring only minor changes to
what follows. The data are all assumed to be i.i.d., that is,
items zn are assumed to come from the same distribution
1230 Neural Comput & Applic (2012) 21:1229–1241
123
(which could be a mixture distribution), and to be distrib-
uted independently of one another. The goal is to investigate
the number and nature of possible clusters in D, where each
cluster refers here to a different model of predictor-response
dependence.
We assume the usual 2D lattice of neurons (although the
ideas extend naturally to higher-dimensional maps) that
have for simplicity’s sake the familiar rectangular topol-
ogy. Let us denote units by their coordinates (i, j). Each
SOMwiseR neuron has an attached MLP model Mij. Let
Wij collectively denote the set of weights attached to per-
ceptron Mij. We denote the various learnable weights and
thresholds in Wij as wijk; k ¼ 1; . . .;K.
Training is based on the sequential, one-at-a-time,
online processing of the zn items. Let zðtÞ 2 D denote the
selected item for training at iteration cycle (or epoch) t. All
MLPs take the given x(t) and compute their output in par-
allel using the current values of the weights. Thus, the
individual error for each neuron (i, j) is computed as
eðtÞij ¼ MijðWðtÞ
ij ; xðtÞÞ � yðtÞ
���
���: ð1Þ
We can then use the eij(t) values to guide the selection of
this cycle’s winning unit or best-matching unit (BMU),
denoted ði�; j�Þ.
ði�; j�Þ ¼ argminijðeðtÞij Þ: ð2Þ
Note that both x(t) and y(t) are needed to determine the
BMU ði�; j�Þ. The BMU provides the best fit for the current
training item, so that ði�; j�Þ and its near neighbors will pay
more attention to the training item z(t) than units lying
further away.
The standard MLP learning rate parameter will be used
to instantiate the general topological SOM learning tenet:
‘‘learn more if you are close to the winner, learn less (or
nothing at all) if you are far’’. Let gðtÞij � 0 denote the
learning rate used by each MLP when learning the current
data item. Then, gij(t) will have a maximum at ði�; j�Þ and
will gradually decrease as the distance to the BMU (in map
space) grows. Each MLP is trained in parallel using the
corresponding rate gij(t). Exactly the same rate is used for all
parameters in the same Wij(t). To compute gij
(t), we use the
familiar Gaussian decay function
gðtÞij ¼ aðtÞ exp�dði; j; i�; j�Þ
2rðtÞ; ð3Þ
where dði; j; i�; j�Þ reflects map distance to the current
BMU, a(t) is the maximum learning rate at iteration t
(reserved for BMUs only), and r(t) is the radius parameter
controlling the peakness of the Gaussian at iteration t. a(t)
and r(t) decrease linearly as usual.
To promote a stable map, we can consider a batch
version by normalizing the neighborhood function:
gðtÞij ¼ aðtÞexp
�dði;j;i�;j�Þ2rðtÞ
PNn¼1 exp
�dði;j;i�n;j�nÞ2rðtÞ
; ð4Þ
where ði�n; j�nÞ represents the winning neuron for data item
zn. Note that batch normalization requires the full training
data set to be available from scratch.
MLPs are updated by the backpropagation algorithm
DwðtÞijk ¼ w
ðtþ1Þijk � w
ðtÞijk ¼ �gðtÞij
oeðtÞ2ij
owijk; k ¼ 1; . . .;K: ð5Þ
Following this procedure, the currently best-matching
units for the training item are adapted, so that the same
behavior will be observed again in the future. Conversely,
units that are not a good match for the current data item
will basically be left alone. The idea is to give the units
some means of selecting the particular data that they will
use to effectively learn their MLP weights.
The pseudocode in Algorithm 1 roughly outlines the
basic online method.
2.2 Refinements
To some extent, the above training scheme is very similar
to that of the standard SOM. We present here some mod-
ifications that greatly improve the algorithm performance.
Specifically, we deal with the selection of the winning
neuron and how to minimize the risk of MLPs getting
trapped in local minima. We also discuss the impact of the
SOMwiseR parameters choice.
Traditionally, the BMU is selected as the closest neuron
to the input, that is, the best match is measured directly in
terms of information at each unit separately. Here, we have
Algorithm 1 SOMwise Regression
Input: training data set D with p variables and n cases
Input: initial maximum learning rate a(0)
Input: initial radius r(0)
Input: number of iterations T
Output: SOMwiseR map
Generate random weights w(1)ijk
for t = 1 to T do
Select a random data item zðtÞ 2 D
Compute eij(t) at each unit for z(t)
Find the BMU ði�; j�Þ ¼ argminijðeðtÞij Þ
aðtÞ :¼ að0Þð1� tTÞ
rðtÞ :¼ 1þ ðrð0Þ � 1ÞT�tT
gðtÞij ¼ aðtÞ exp�dði;j;i� ;j�Þ
2 rðtÞ, for all neurons
wðtþ1Þijk ¼ w
ðtÞijk � gðtÞij
oeðtÞ2ij
owijk, for all neurons, k ¼ 1; . . .;K
end for
Neural Comput & Applic (2012) 21:1229–1241 1231
123
found that the exact replica of this idea, namely the lowest
eij, is insufficient for our purposes. It turns out that even if
the appropriate unit does a good shot at the target, another
unit may get closer by pure chance. This phenomenon is
more frequent for large maps and in the early phases, when
randomness in unit distributions is high. This not only
deprives the MLP in question (and its neighbors) of some
helpful learning, but it also potentially corrupts the learning
processes at and around the opportunistic MLP. As a result,
low individual error looks to be too noisy a criterion for
BMU selection.
Instead, we consider a more robust error measure that
includes a (small) neighborhood of (i, j), induced by a
small radius rerror:
EðtÞij ¼X
i0;j0eðtÞi0j0 exp
�dði; j; i0; j0Þ2rerror
: ð6Þ
Now, as soon as the SOMwiseR map begins to build the
relevant clusters, the ‘‘true’’ BMU will be surrounded by
units with close predictions. This will tend to keep its Eij
low. On the other hand, any opportunistic unit will be
typically surrounded by foreign units with a larger
predictive variance, which will raise the Eij summation.
Using the minimization of (6) to select, the BMU appears
to reduce substantially the number of harmful distractions.
A similar issue has also been discussed for the traditional
SOM [20].
A related issue refers to the familiar border effect: the
most relevant units in each cluster tend to lie on the edges
of the map. This is because not all the neurons have the
same number of neighbors. In the early stages, especially, a
bad MLP could severely decrease the average fitness of the
neighborhood, making edge neurons more biased toward
winning the data item. A smaller neighborhood has a lower
probability of including a bad neuron. Here, we use a
toroidal map to address the border effect, so that the
neighborhood of all neurons will have a common size.
Besides, the susceptibility of an MLP model to get
trapped in local minima seriously threatens the continuity
of the SOMwiseR map. To deal with this, we can employ
models that alleviate the problem at the MLP level [9]. In
this paper, we try to avoid the problem at the higher SOM
level. Tokunaga and Furukawa [16] state an equivalent
problem and give a method to minimize this risk. We will
include a similar approach in our algorithm. Let us define
the energy function
/ðtÞij ¼XN
n¼1
eðtÞij ðnÞ exp
�dði; j; i�ðnÞ; j�ðnÞÞ2rðtÞ
; ð7Þ
where ði�ðnÞ; j�ðnÞÞ is the BMU for the n-th data item at
iteration t and eij(t)(n) is the error of neuron (i, j) for this data
item.
Each neuron (i, j) is supposed to minimize /ij(t)
throughout the training process. We may heuristically
identify a neuron affected by a local minimum, when, if we
substitute its corresponding MLP by any of its neighbors,
the energy function at this point decreases. The procedure
in [16], which we follow here, is to periodically check the
map looking for all the neurons meeting the condition
/ðtÞij \� � /ðtÞ0
ij ; ð8Þ
where � is a constant (typically around 0.8) and /ij(t)0 is the
energy function if we substitute the MLP in (i, j) by a copy
of the MLP of some neighbor. In this case, the MLP in
(i, j) is effectively replaced by a copy of the neighbor’s
MLP.
Note that this procedure is only feasible when the
complete data set is fully available at any time during
training. For large data sets, however, the evaluation of the
energy function (7) for all neurons becomes expensive. In
this case, this should be seldom applied during the
training.
Finally, the MLP structure (the number of MLP hidden
layers and neurons) is the most sensitive configuration
setting and has to be carefully chosen. Note that, when the
data dimension is high, data points get sparser and it may
happen that the input distribution is different for each
cluster. Then, a single MLP can probably represent dif-
ferent clusters, that is, different input–output functions. In
this case, it is important to reduce the models’ power. For
example, we can use single perceptrons so that units cor-
respond to linear functions. In this case, the model is
related to Kohonen’s Operator Map [10]. Unlike SOMw-
iseR (and like mnSOM), the Operator Map was designed to
process sequences of data or batches, where each sequence
is known to belong to the same cluster. Inspecting how
many clusters are deployed by the model for some amounts
of units complexity can give useful insight about the data
and the respective input–output functions.
On the other hand, selection of a(0) and r(0) parameters
is very much like in the traditional SOM.
3 Analysis
We now deal with how to interpret and validate a trained
SOMwiseR model. We also discuss its predictive potential.
For interpretation purposes, we propose a two-step
procedure. We first try to both identify the number of
clusters and clarify the map units membership of clusters.
This step is clearly related to the familiar step in SOM-
based analysis. Then, we look at the predominant predic-
tive MLP models encoded within each cluster. This will
entail a preselection of some units as the most trustworthy
1232 Neural Comput & Applic (2012) 21:1229–1241
123
predictors. Overall, this step should not be much more
complex than interpreting a single standard MLP model.
To ascertain the number of clusters present in the map,
we perform a final pass on D (using (6)) and obtain the
dataload matrix K ¼ kij
� �
, where kij equals the number of
data items zn won by map unit (i, j). We focus on the units
with the largest kij as the ‘‘epicenters’’ of the clusters in the
map. We also look for borders between clusters made up of
units with low kij. Ideally, each cluster exhibiting a dif-
ferent behavior for the response should translate into a
single neuron or, most likely, a few neurons close together
in an area of the map. Many intermediate neurons should
contain mixed-up, garbage MLPs and win no data at all,
meaning that the final picture should be easy to interpret.
Although crucial for SOMwiseR postprocessing, K does
not usually play a major role in the analysis of the standard
SOM, yet it is particularly useful for judging when the
SOM reaches the equiprobabilistic state [20]. Note that
SOMwiseR does not pursue the equiprobabilistic state.
Inside the tentative clusters, we care most about those
units with the largest kij. Each of these units has provided
the model with the largest scope of application and should
be examined first. They constitute the best explanation for
the response over the subset of data that really belong to
that cluster. Complementary to K, we can also consider the
average prediction error matrix (using (1)) over the set of
items won by the unit. Note that the average error is
analogous to the mean quantization error in the standard
SOM. Units with the lowest average error would appear to
entail the most precise MLPs and therefore should also be
checked out.
Summing up, a valid postprocessing method could be as
follows. First, we obtain K and the average prediction error
matrix. Then, we keep those groups of contiguous neurons
that altogether sum a significant amount of won data items
(for example, 5% of the total data items). We can also
consider those groups of neurons whose average prediction
error is under a given threshold. Finally, we select the most
significant MLP for each cluster. This is, for example, the
MLP with the largest kij within the cluster.
The quality of the model should be verified before all
analyses. A key aspect of a standard SOM is the preser-
vation of the topological order [22]. In a nutshell, a well-
organized SOM model is expected to project close data
onto the same or nearby neurons, assuring that the map is
consistent with regard to the data set distribution. The
topographic product (TP) [1] and the directional product
(DP) [21] are two methods for quantitatively evaluating the
topological order preservation. Both are constructed on the
basis that close SOM weights (in the input data space)
should correspond to close neurons (in the map grid space).
A map with perfect topological order has a zero TP
coefficient, whereas the DP coefficient equals 1.0. Worse
maps will correspond to higher absolute values of TP and
lower DP values (rarely under 0.9 though).
To apply this kind of measures to a SOMwiseR model, a
distance function between MLPs should be defined to
replace the Euclidean distance between the traditional
SOM weights. Here, we propose instead to represent each
MLP Mij by a vector mðtÞij formed by the responses that Mij
produces from some data at iteration cycle t:
mðtÞij ¼ Mij W
ðtÞij ;R
� �
; ð9Þ
where R is a set of vectors in the predictor space. R can be
either a subset of the original training data set or a ran-
domly generated set of p-elements vectors. Similar MLPs
are expected to produce an analogous vector of responses
from the same input data. Now, any of the above measures
can be applied over the response vectors mðtÞij .
A final issue (not related to CWR structure detection) is
whether we can use the trained SOMwiseR model to make
point predictions. Specifically, given a new xN?1 vector,
the question is which MLP or MLPs should be used for
prediction of the associated response. Note that individual
predictions of this sort are not straightforward, for we
cannot project xN?1 directly onto the map. Indeed, the
clustering of zn data items in the SOMwiseR map disre-
gards the location of the xn vectors in the x-space: as long
as the associated response follows the same model, it is
irrelevant how shattered (in the x-space) the cluster may
become.
Although the quality of point prediction is not the reason
why one would want to use the SOMwiseR approach, there
are some sensible approaches to the issue of prediction. For
example, we can examine the j nearest neighbors of the
input xN?1, where j depends on the amount of data and the
cardinality of the cluster. If all these zn data project onto
the same unit or neighboring units, we expect the corre-
sponding MLPs to provide helpful predictions for xN?1 too.
This is because, although we have stressed that the clusters
can arbitrarily expand in x-space, some basic continuity
could certainly be expected. Otherwise, if the j selected
neighbors project onto distant units on the map, then we
must abstain from predicting, as the selected input vector
belongs to an ambiguous area where several clusters may
overlap in the x-space.
4 Experiments
In this section, we consider two artificial scenarios for
comparing the proposed algorithm to two state-of-the-art
algorithms, XYF and BDK [14]. Topological order
Neural Comput & Applic (2012) 21:1229–1241 1233
123
preservation is evaluated by the DP and TP measures. Also,
we compare SOMwiseR to some not topological preserving
CWR methods over the real Electricity data set [13].
4.1 Synthetic data
The first scenario is a family of data sets suitable for testing
different abilities in the CWR setting. Sampling is based on
a mixture distribution, say f(x, y), reflecting some type of
clusters in the (x, y)-space. In this case, we consider two
clusters and two predictors. The data sets contain 15,000
items.
All mixture components in the (x, y)-space are sampled
according to f(y|x)f(x). The two f(x) densities associated
with the clusters are Gaussians N2ð0; 4 � IÞ and N2ðl; 4 � IÞ,respectively, for some 0\l� 10. We consider the case of
no overlap (l = 10) as well as the overlap case (l\ 10).
The response within clusters is expressed as y = /(v0x) ? n, where n follows a N(0,s2) distribution with s� 0
and /(v0x) = 1/(1 ? exp(- v0x)). We thus consider the
noise case (called output noise) when s[ 0. Key vectors v0
and vl are chosen, so that the random variates /(v00x) and
/(vl0x) have the same mean and variance. The sigmoid
link / is chosen to ensure a simpler evaluation of the
learned MLP models. We also address the background
noise scenario, where f(x) is uniform over the square
(- 10, 20) 9 (- 10, 20), entirely covering the effective
support of the two Gaussians in all cases, and the noise
response y is distributed across the unit interval (regardless
of x). If we denote the sampling proportions as p0; pl and
pb ¼ 1� ðp0 þ plÞ, then we can consider the cases pb [ 0
and pb = 0. We also consider the balanced case p0 ¼ pl as
well as the rare case where one of the clusters is more
present than the other one [24]. Table 1 shows the various
cases that we have analyzed. At this stage, we aim to gain a
basic understanding of the method’s sensitivity by ana-
lyzing one possible source of difficulty at a time.
For this family of data sets, we use 10 9 10 squared
maps. MLPs are provided with one hidden layer and two
hidden units for the MLPs. All MLPs are randomly ini-
tialized by independently sampling each MLP weight wijk
from a uniform distribution. The hidden units have sigmoid
activation, and the output unit is linear. Mimicking the
well-known heuristic for the ordinary SOM, we carry out
training in two phases, of 10,000 and 100,000 cycles,
respectively. In the first phase, the initial radius r(0) is set to
5 and a(0) = 0.2. In the second phase, r(0) is shrunk to 2.5
and a(0) = 0.1.
Figure 1 shows an average quality run for the Basic data
set. Figure 1a shows the dataload matrix K, where darker
means higher. The two clusters emerge perfectly. Since this
is a toroidal map, the four corners of the maps are con-
tiguous. Figure 1b depicts the average error matrix (using
(1)). Using a map-based scaling, darker means lower and
white means that the neuron has not won any data item.
There are several good models in each cluster’s realm.
Figure 1c, d examine the response maps. Here, we select,
for each cluster, a training item zn at random. We then plot
the individual error (1) committed at each unit of the
SOMwiseR structure for zn. Darker means lower as in the
case of the error matrix. Two clearly differentiated
response patterns can be discerned in conformity with the
two identified clusters. As expected, we observe a smooth
growth of the error as we move away from the focus of the
cluster.
Figure 2a shows a typical map generated by the XYF
algorithm. This map is also toroidal. The quality of the
BDK maps, not shown here, is very similar. In this case,
the maps approximate the equiprobabilistic state. Fig-
ure 2b, c show the dataload matrices for each cluster. As
shown, although one of the clusters wraps the other one,
the map is perfectly divided into two parts.
Table 1 Training data sets
Basic l = 10 p0 ¼ pl pb = 0 s = 0
Overlap l\ 10
Rare-cluster p0 [ pl
Background-noise pb [ 0
Output-noise s[ 0
Fig. 1 Gray matrices output by SOMwiseR for the Basic data set: a Dataload matrix, b Error matrix, c First cluster response matrix, d Second
cluster response matrix
1234 Neural Comput & Applic (2012) 21:1229–1241
123
Table 2 shows the TP and DP coefficients over 10
executions. Note that, whereas we may directly apply DP
and TP to the XYF and BDK maps, we first need to
compute vectors mij from the SOMwiseR model by (9).
Therefore, comparisons of topological order preservation
between SOMwiseR and the other methods are only
approximate. For this data set, the DP coefficient is
better for SOMwiseR than for the other algorithms, but
the TP coefficient is worse. TP is more general purpose
than DP, whereas DP is more appropriate for SOM-like
maps [21].
We now switch to the Overlap data set (more specifi-
cally, l = 4 or the heavy overlap case). Figure 3 shows a
typical SOMwiseR execution. Again, the two clusters can
be clearly appreciated in our runs. Interestingly, the pres-
ence of heavy overlap between the clusters in the x-space
does not seem to pose a problem for the approach. Figure 4
depicts a usual BDK map. The quality of XYF maps and
BDK maps is very alike. In general, compared to the Basic
data set, since both XYF and BDK in part rely on the x-
space distribution of the data, maps generated by these
methods are worse at separating the two clusters. However,
there are no big differences in the DP value of the three
algorithms (see Table 2). Therefore, in this case, it appears
that the SOMwiseR maps represent the data more faithfully
than XYF and BDK, while also preserving the (data-
independent) topological order. Note that TP and DP again
disagree when comparing SOMwiseR with XYF and BDK.
Figure 5 presents the results for the Rare-cluster data set
ðp0 ¼ 80%; pl ¼ 20%Þ. We now observe that the split
suggested by the dataload image still works. Note that the
response maps are also very much differentiated. Figure 6
shows an XYF map where the big cluster wraps the smaller
one. Table 2 indicates that the XYF and BDK maps have
similar DP and TP values to SOMwiseR.
We also see that SOMwiseR appears to be sensitive to
the presence of noise, be this inherent to the response or the
background. When background noise is added, clusters
units are sometimes perturbed by the noisy data. This
results in blurrier dataload matrices.
Figure 7 shows a SOMwiseR run for the pb = 30%
case. The clusters are still differentiated. The error matrix
can help to find the cluster epicenters. Figure 8 shows a
BDK map. The map is scattered but separates the two
clusters to some extent.
When some output noise is added to the data, clusters
become more widespread. This is hardly surprising for a
system that heavily relies on prediction quality. Never-
theless, as shown in Fig. 9 (s = 0.09), the two clusters are
neatly separated even when the cluster areas are wider.
Figure 10 depicts an average XYF map. Although the
separation of the map is neat, the border between clusters is
twisted. Table 2 shows that, when either background or
output noise is present, the topological order is more
robustly preserved for SOMwiseR maps than for XYF and
BDK maps.
Fig. 2 Gray matrices output by
XYF for the Basic data set:
a Dataload matrix of the full
data set, b Dataload matrix of
the first cluster, c Dataload
matrix of the second cluster
Table 2 DP and TP measures of the maps trained by SOMwiseR,
XYF and BDK, for all the data sets
SOMwiseR XYF BDK
Basic TP 0.11(± 0.04) 0.08(± 0.006) 0.085(± 0.005)
DP 0.945(± 0.008) 0.928(± 0.01) 0.929(± 0.008)
l = 6 TP 0.133(± 0.02) 0.032(± 0.002) 0.033(± 0.001)
DP 0.953(± 0.008) 0.948(± 0.002) 0.95(± 0.005)
l = 4 TP 0.105(± 0.01) 0.024(± 0.002) 0.025(± 0.002)
DP 0.959(± 0.005) 0.958(± 0.002) 0.952(± 0.004)
pl = 20% TP 0.072(± 0.02) 0.062(± 0.003) 0.073(± 0.004)
DP 0.951(± 0.009) 0.956(± 0.003) 0.939(± 0.009)
pl = 10% TP 0.098(± 0.03) 0.065(± 0.006) 0.071(± 0.002)
DP 0.945(± 0.003) 0.950(± 0.006) 0.940(± 0.006)
pb = 20% TP 0.019(± 0.008) 0.048(± 0.004) 0.055(± 0.005)
DP 0.980 (± 0.004) 0.939(± 0.01) 0.927(± 0.02)
pb = 30% TP 0.018(± 0.004) 0.044(± 0.002) 0.052(± 0.002)
DP 0.984(± 0.004) 0.945(± 0.003) 0.938(± 0.009)
s = 0.05 TP 0.026(± 0.01) 0.08(± 0.008) 0.082(± 0.002)
DP 0.98(± 0.005) 0.935(± 0.01) 0.936(± 0.005)
s = 0.09 TP 0.043(± 0.02) 0.079(± 0.007) 0.089(± 0.006)
DP 0.972(± 0.007) 0.948(± 0.01) 0.933(± 0.008)
4CF TP 0.088(± 0.006) 0.092(± 0.006) 0.085(± 0.001)
DP 0.977(± 0.002) 0.933(± 0.01) 0.945(± 0.003)
Electricity TP 0.061(0.006) 0.062 (0.001) 0.063 (0.005)
DP 0.934(0.02) 0.947 (0.005) 0.935 (0.008)
The best value for each data set and type of measure is highlighted
Neural Comput & Applic (2012) 21:1229–1241 1235
123
Fig. 3 Gray matrices output by SOMwiseR for the Overlap (l = 4) data set: a Dataload matrix, b Error matrix, c First cluster response matrix,
d Second cluster response matrix
Fig. 4 Gray matrices output by
BDK for the Overlap (l = 4)
data set: a Dataload matrix of
the full data set, b Dataload
matrix of the first cluster,
c Dataload matrix of the second
cluster
Fig. 5 Gray image matrices output by SOMwiseR for the Rare-cluster ðp0 ¼ 80%;pl ¼ 20%Þ data set: a Dataload matrix, b Error matrix,
c First cluster response matrix, d Second cluster response matrix
Fig. 6 Gray matrices output by
XYF for the Rare-cluster ðp0 ¼80%; pl ¼ 20%Þ data set:
a Dataload matrix of the full
data set, b Dataload matrix of
the first cluster, c Dataload
matrix of the second cluster
Fig. 7 Gray matrices output by SOMwiseR for the Background-noise (pb = 30%) data set: a Dataload matrix, b Error matrix, c First cluster
response matrix, d Second cluster response matrix
1236 Neural Comput & Applic (2012) 21:1229–1241
123
The next data set (which we will call 4CF) counts just
one predictor and four different clusters, each corre-
sponding to a cubic function (see Fig. 11a). The 4CF data
set contains 4,000 items (1,000 for each cluster). The
predictor is sampled uniformly from the interval [-2, 2].
To deal with the 4CF data set, we shall use bigger
15 9 15 squared maps, where all MLPs have one hidden
layer and eight hidden units. In this case, we need a larger
map size, so that the four clusters have room to appear. We
again carry out two training phases, of 10,000 and 100,000
Fig. 8 Gray matrices output by
BDK for the Background-noise(pb = 30%) data set: a Dataload
matrix of the full data set,
b Dataload matrix of the first
cluster, c Dataload matrix of the
second cluster
Fig. 9 Gray matrices output by SOMwiseR for the Output-noise (s = 0.09) data set: a Dataload matrix, b Error matrix, c First cluster response
matrix, d Second cluster response matrix
Fig. 10 Gray matrices output
by XYF for the Output-noise(s = 0.09) data set: a Dataload
matrix of the full data set,
b Dataload matrix of the first
cluster, c Dataload matrix of the
second cluster
Fig. 11 a The four cubic
functions contained in the 4CFdata set. b Grid of functions
generated by SOMwiseR
Neural Comput & Applic (2012) 21:1229–1241 1237
123
cycles, respectively. The initial radius is 7.0 in the first and
3.5 in the second phase. The initial learning rate is 0.2 and
then 0.1.
Figure 11b shows the map of function shapes repre-
sented by the MLP units for one execution. The four units
with the highest number of won data items, representative
of the four clusters, are highlighted. As shown, each rep-
resentative function successfully corresponds to one of the
cubic functions in Fig. 11a. Figure 12 depicts the dataload
and average error matrices. The four clusters emerge
clearly in the dataload matrix K. As expected, the corre-
sponding cluster centers show low prediction errors. The
response map patterns also neatly reflect the cluster struc-
ture (see Fig. 13).
Figure 14a shows a BDK map. Figure 14b, c. show the
dataload matrices for the first and second clusters. The
third and fourth clusters are comparable and are not dis-
played. Note that the XYF and BDK dataload matrices
display a uniform mesh, where no clusters can be appre-
ciated. This is because both the XYF and the BDK methods
determine the winning unit by separately considering the
similarities in the predictor x-space and the similarities in
the response y-space. As shown in Fig. 11a, the densities of
the four functions are the same on the X-axis (uniformly
sampled from [-2, 2]). The densities of the four functions
are not very different on the Y-axis. There is almost no
information in the x-space and the y-space (separately
considered) to distinguish the clusters.
Furthermore, according to DP, the topology order is
better preserved for the SOMwiseR maps (see Table 2).
However, although the XYF and the BDK maps are not
informative, they preserve the topological order to some
extent.
From these synthetic scenarios, we conclude that
SOMwiseR often discovers the clusters structure, to some
extent preserving the topological order, although there do
tend to be some topology disorders in specific settings.
4.2 Real data
Finally, we have tested the method on a real-world prob-
lem, the Electricity data set [13]. This data set includes one
sampling for each U.S. state (N = 50), with three inde-
pendent variables (price of electricity, per capita income
and price of gas) and a response variable (per capita
electricity consumption). The states are known to be dis-
tributed over two segments.
Note that this is a realistic case where we do not know
the true distribution of the data items in the clusters, and,
since XYF and BDK pursue an equiprobabilistic state,
there is no evidence in the XYF and BDK maps to infer the
number of clusters and their structure. Whereas XYF and
BDK are prediction-oriented algorithms, SOMwiseR is
rather a method for discovering cluster structure.
We use 10 9 10 maps, with a single (linear) perceptron
at each unit. Training is done in two phases (15,000 and
150,000 cycles). In the first phase, r(0) is 5 and a(0) is 0.01.
In the second phase, r(0) is 2.5 and a(0) is 0.005.
Figure 15 shows the results. At first glance, the two
clusters are clearly noticeable. Two well-defined areas
appear in the dataload matrix, representing the two seg-
ments. Note that the dark units at the bottom of the map are
neighbors of the two units situated at the top. The error
matrix is helpful in this case to clarify where the epicenters
of the clusters are located. For instance, the right cluster
can be fairly well represented by the bottommost unit of the
cluster. The response maps confirm these claims.
Although not presented here, we have also trained maps
with more strong, nonlinear MLPs. Specifically, we have
trained maps where the MLPs have one hidden layer and
four hidden neurons and maps where the MLPs have two
hidden layers of, respectively, four and two hiddenFig. 12 Dataload and error matrices for the 4CF data set
Fig. 13 Response matrices for the 4CF data set, each representing one of the four functions
1238 Neural Comput & Applic (2012) 21:1229–1241
123
neurons. In both cases, whereas the model mean squared
error is not very different, the separation between the two
clusters is less clear. This may suggest that the data are
better fitted with a two linear clusters model.
The resulting matrices for the XYF and BDK algorithms
are pale gray plains and are not shown. Regarding topology
order preservation, both TP and DP measures are equiva-
lent for the three algorithms (see Table 2).
Table 3 shows a comparison between SOMwiseR and
other CWR methods, displaying the model mean squared
error and the number of discovered clusters. For SOMw-
iseR, the most representative unit of each cluster is chosen
for evaluation. Results are averaged over 10 runs. The
other CWR methods are linear regression fixed point
clustering (FPC) [6] and a maximum likelihood estimator
under a regression mixture model (MLRM) [5]. The latter,
computed by the EM algorithm, needs the number of
clusters to be specified. We have chosen the amount of
clusters that minimizes the Bayesian Information Criterion
(BIC). Besides, Table 3 displays mean squared errors for
XYF and BDK. Finally, we have also included results for
ordinary least squares, MLP with one hidden layer of four
hidden neurons and MLP with two hidden layers of,
respectively, four and two hidden neurons.
Whereas FPC always includes all the data items in one
cluster, leading to higher mean squared error, MLRM
detects five clusters and is likely to overfit the data. When
units are provided with linear models, SOMwiseR detects
two clusters in most executions. Note that the prediction
error is a bit higher than for MLRM but much lower than
for FPC (which is equivalent to a usual linear regression).
As mentioned before, when units are provided with more
complex models, SOMwiseR does not so clearly discrim-
inate the two clusters and the prediction error is not sig-
nificantly decreased. On the other hand, XYF and BDK
Fig. 14 Gray matrices for the
4CF data set: a Dataload matrix
of the full data set, b Dataload
matrix of the first cluster,
c Dataload matrix of the second
cluster. Clusters 3 and 4 are not
shown
Fig. 15 Gray matrices for the Electricity data set. a Dataload matrix, b Error matrix, c First cluster response matrix, d Second cluster response
matrix
Table 3 Mean squared error and number of clusters for the Elec-tricity data set
Method Mean squared error No. clusters
SOMwiseR (linear) 4.22(± 2.04) 2
SOMwiseR 4hn 4.37(± 0.95) 2
SOMwiseR 6hn 4.21(± 1.01) 2
FPC 20.78(± 0.0) 1
MLRM 1.17(± 0.91) 5
XYF 5.54(± 2.79) –
BDK 5.44(± 1.84) –
OLS 21.11(± 0.0) –
MLP 4hn 20.88(± 0.0) –
MLP 6hn 20.78(± 0.0) –
CWR methods: SOMwiseR with linear units, SOMwiseR whose units
have one hidden layer of four hidden neurons (SOMwiseR 4hn),
SOMwiseR whose units have two hidden layers of, respectively, four
and two hidden neurons (SOMwiseR 6hn), linear regression fixed
point clustering (FPC) and maximum likelihood under a regression
mixture model (MLRM). Non-CWR methods based on topology:
XYF and BDK. Non-CWR methods non-based on topology: ordinary
least squares regression (OLS), MLP with one hidden layer of four
hidden neurons (MLP 4hn) and MLP with two hidden layers of,
respectively, four and two hidden neurons (MLP 6hn). Results are
averaged over 10 runs
Neural Comput & Applic (2012) 21:1229–1241 1239
123
have a low model mean squared error. This is expectable
because all units are competing to give the best prediction
for each data item. Finally, note that all the non-CWR, non-
topological methods yield high mean squared errors.
Summing up, SOMwiseR has also proven to work well
in a real environment and could be a valuable tool for
gaining insight from data that other algorithms are unable
to analyze. SOMwiseR appears to be able to distinguish
real data sets, where more than one model is present, from
simpler data sets, where a single model can represent all
the data items.
5 Discussion
We have introduced a new learning architecture for the
CWR modeling problem. The novel SOMwiseR approach
seems to neatly translate the unsupervised learning prin-
ciples of the standard SOM algorithm into the supervised
learning scenario. Like the methods in [14, 16], SOMwiseR
can be considered as a fusion between supervised and
unsupervised classification. We have shown that SOMw-
iseR can learn to differentiate the functional clusters in
various both synthetic and real cases of practical interest.
So far, we have used a number of analytic tools to assess
the SOMwiseR trained model. We have also revealed the
system’s sensitivity to noise. A software package has been
developed in the R environment and is available on
request.
There are a few more ideas that have not been analyzed
in this paper. We now summarize these ideas.
To complement the dataload and error matrices, we can
also rely on truly U-like matrices for cluster identification.
Like U-Matrix [19], we can meaningfully compute the
average quadratic deviation between any two units’ outputs
over the full data set, ignoring the actual responses. Since
there is some evidence that some type of functional inter-
polation can be achieved in our MLP context [14], the
resulting U-Matrix may produce another helpful view of
the target cluster structure. The P-Matrix [18], specifically
designed for dealing with toroidal maps, displays an esti-
mation of the data space density at each neuron, much like
kernel density estimation. To compute a versioned
P-Matrix for SOMwiseR, the number of data items whose
prediction error is under some threshold for each neuron is
divided by the threshold itself. Therefore, this threshold
corresponds in the original P-Matrix to the radius of the
hypersphere whose volume is to be estimated. Note that
both the P-Matrix and the U-Matrix are specially suitable
for maps where the number of data items is related to the
number of neurons.
To promote the proper distribution of clusters over the
map, we can introduce more powerful MLPs with a larger
number of hidden neurons scattered in specific positions
across the map. These enhanced MLPs would have greater
chances of winning data and thus would typically help to
build the clusters.
The SOMwiseR map can also be seen as a mere data
preprocessor for simplifying the clustering task. In this
case, each zn would be replaced by its response map, with
the idea of displaying clusters more obviously in the
response map space. Kontkanen et al. [11] have proposed a
similar idea. They replace each xn vector with its predictive
distribution given a central Bayesian network model. They
then consider the xn vectors that lead to similar predictive
distributions as being similar. Finally, these similarities are
used to project data items onto a planar region for visual
inspection.
Finally, as already noted, some other supervised learn-
ing methods may also play the role of the MLP in the
SOMwiseR approach; see also [24]. A similar idea was
also introduced in [16].
Acknowledgments This research was partially supported by pro-
jects TIN2007-62626 and Cajal Blue Brain. We are very grateful to
Prof. Concha Bielza and Prof. Pedro Larranaga for heir valuable
support. Finally, we would like to express our very special gratitude
in the memory of the first author, Prof. Jorge Muruzabal, who devised
the idea and provided the inspiration for this and many others papers.
It was a pleasure to work with him and share his enthusiastic attitude
to the science.
References
1. Bauer H, Pawelzik K (1992) Quantifying the neighborhood
preservation of self-organizing feature maps. IEEE Trans Neural
Netw 4(3):570–579
2. Bishop CM (1995) Neural networks for pattern recognition.
Oxford University Press, Oxford
3. Brusco MJ, Cradit JD, Tashchian A (2003) Multicriterion clust-
erwise regression for joint segmentation: an application to cus-
tomer value. J Mark Res 40(2):225–234
4. Chtourou S, Chtourou M, Hammami O (2008) A hybrid approach
for training recurrent neural networks: application to multi-step-
ahead prediction of noisy and large data sets. Neural Comput
Appl 17(3):245–254
5. DeSarbo W, Cron W (1988) A maximum likelihood methodology
for clusterwise linear regression. J Classif 5:249–282
6. Hennig C (1999) Models and methods for clusterwise linear
regression. In: Gaul W, Locarek-Junge H (eds) Classification in
the information age. Springer, Berlin, pp 179–187
7. Herrmann L, Ultsch A (2007) Label propagation for semi-
supervised learning in self-organizing maps. In: 6th International
workshop on self-organizing maps, Bielefeld, Germany
8. Heskes T (1999) Energy functions for self-organizing maps. In:
Oja E, Kaski S (eds) Kohonen maps. Elsevier, Amsterdam,
pp 303–316
9. Kathirvalavakumar T, Jeyaseeli Subavathi S (2009) Neighbor-
hood based modified backpropagation algorithm using adaptive
learning parameters for training feedforward neural networks.
Neurocomputing 72(16–18):3915–3921
10. Kohonen T (2001) Self-organizing maps. Springer, Berlin
1240 Neural Comput & Applic (2012) 21:1229–1241
123
11. Kontkanen P, Lahtinen J, Myllymaki P, Silander T, Tirri H
(2000) Supervised model-based visualization of high-dimen-
sional data. Intell Data Analysis 4(3–4):213–227
12. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I,
Lozano J, Armananzas R, Santafe G, Perez A, Robles V (2006)
Machine learning in bioinformatics. Brief Bioinformat
7(1):86–112
13. McCormick R (1993) Managerial economics. Prentice-Hall,
Englewood Cliffs, NJ
14. Melssen W, Wehrens R, Buydens L (2006) Supervised Kohonen
networks for classification problems. Chemom Intell Lab Syst
83(2):99–113
15. Srivastava S, Zhang L, Jin R, Chan C (2008) A novel method
incorporating gene ontology information for unsupervised clus-
tering and feature selection. PLoS ONE 3(12):e3860
16. Tokunaga K, Furukawa T (2009) Modular network SOM. Neural
Netw 22(1):82–90
17. Tsimboukakis N, Tambouratzis G (2007) Self-organizing word
map for context-based document classification. In: 6th Interna-
tional workshop on self-organizing maps, Bielefeld, Germany
18. Ultsch A (2003) Maps for the visualization of high-dimensional
data spaces. In: Workshop on self-organizing maps, Kyushu,
Japan, pp 225–230
19. Ultsch A, Siemon H (1990) Kohonen’s self-organizing feature
maps for exploratory data analysis. In: Proceedings of the inter-
national neural networks conference, Kluwer Academic Press,
Paris, pp 305–308
20. Van Hulle MM (2000) Faithful representations and topographic
maps: from distortion- to information-based self-organization.
Wiley, New York
21. Vidaurre D, Muruzabal J (2007) A quick assessment of topology
preservation for SOM structures. IEEE Trans Neural Netw
18(5):1524–1528
22. Villmann T, Herrmann M, Martinetz T (1997) Topology preser-
vation in self-organizing feature maps: exact definition and
measurement. IEEE Trans Neural Netw 8(2):256–266
23. Villmann T, Seiffert U, Schleif F, Bruß C, Geweniger T, Hammer
B (2006) Fuzzy labeled self-organizing map with label-adjusted
prototypes, LNAI, vol 4087. Springer, Ulm, Germany, pp 46–56
24. Weiss GM (2004) Mining with rarity: a unifying framework.
SIGKDD Explor Newsl 6(1):7–19
25. Xiao Y, Clauset A, Harris R, Bayram E, Santago P, Schmitt J
(2005) Supervised self-organizing maps in drug discovery: 1.
Robust behavior with overdetermined data sets. J Chem Inf
Model 45(6):1749–1758
Neural Comput & Applic (2012) 21:1229–1241 1241
123