smem algorithm for mixture regression problems
TRANSCRIPT
SMEM Algorithm for Mixture Regression Problems
Satoshi Suzuki and Naonori Ueda
NTT Communication Science Laboratories, Kyoto, 619-0237 Japan
SUMMARY
We investigate applications of the SMEM (Split and
Merge EM) algorithm to regression problems. We use the
NGnet (Normalized Gaussian network) as a mixture regres-
sion model in which a set of input and output variables is
dealt with as a joint probability. We describe the split and
merge operations and criteria necessary for applying the
SMEM algorithm to the NGnet. We show the usefulness of
the SMEM algorithm for regression problems through ex-
periments using artificial and real data. © 2001 Scripta
Technica, Electron Comm Jpn Pt 2, 84(12): 54�62, 2001
Key words: EM algorithm; regression; NGnet;
mixture model; maximum likelihood estimation; joint dis-
tribution.
1. Introduction
The SMEM (Split and Merge EM) algorithm pro-
posed by one of the authors is a method based on the EM
algorithm for mixture models [1]. The algorithm avoids
local optima and leads to better results by splitting and
merging its components. The effectiveness of the algorithm
for mixture density estimation and data compression has
been shown in computational experiments [2]. Until now,
however, its effectiveness for regression problems had not
been clarified because of the differences in the charac-
teristics between regression and distribution estimation. We
mainly show the effectiveness of the SMEM algorithm for
regression problems in this paper.
The MEnet (Mixture Expert network) proposed by
Jacobs and colleagues [3] is generally popular as a mixture
regression model. A learning method using the EM algo-
rithm has also been proposed for the MEnet [4]. However,
we cannot apply this method with the SMEM algorithm
because the Q function of the SMEM algorithm must be repre-
sented in the form of a direct sum over its components [2].
Xu and colleagues proposed and demonstrated a
learning network based on the MEnet in which they re-
placed the gating network with Gaussian functions for fast
convergence of the EM algorithm [5]. They also showed
that when a set of input and output variables is dealt with
as a joint probability, the Q function can be expanded such
that the algorithm can be represented as a direct sum over
its components, and therefore, the parameter values are
independently calculated in each component. That is, we
can apply the SMEM algorithm to this network, which we
call the NGnet (Normalized Gaussian network) [6]. Since
the SMEM algorithm repeatedly performs the EM algo-
rithm in the process, we can expect a much faster conver-
gence toward optimization by the combination of the
SMEM algorithm and the NGnet.
As mentioned above, we verify the effectiveness of
the SMEM algorithm for mixture regression problems with
the NGnet. We also modify the split and merge operations
to accommodate regression, which is the core operation in
the SMEM algorithm.
2. NGnet and SMEM Algorithm
The NGnet has a modular structure like the MEnet.
The transformation of input x � Rdx to output y � Rd
y is
given by
© 2001 Scripta Technica
Electronics and Communications in Japan, Part 2, Vol. 84, No. 12, 2001Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J83-D-II, No. 12, December 2000, pp. 2777�2785
54
where Gi�x|mi, 6i� denotes a dx-dimensional normal distri-
bution:
Here, M is the number of components and mi, 6i, and Wi are
the mean vector, the covariance matrix, and the transforma-
tion matrix of the i-th component, respectively.
The differences between the NGnet and the MEnet
are in the structure of the gating network and, more impor-
tant, the treatment of the input as a probabilistic variance.
That is, in the NGnet model, we treat data sets
D {�xn, yn�, n 1, . . . , N} as joint probabilistic distribu-
tions; on the other hand, in the MEnet model, we regard
only output data y as a probabilistic distribution.
As with general regression models, we regard the
distribution of output y as a Gaussian distribution in which
the mean vector is fi�xi; Wi� and the covariance matrix is Si.
That is, the posterior probability of output y is given by
Therefore, the parameters we should estimate are
The log-likelihood of the complete data �xn, yn� is given by
where P(i) means the prior probability of model selection
and its value is set to 1/M as a uniform distribution.
From the above, the Q function in the NGnet is given
by
where 4�t� denotes the parameter values estimated after the
t-th iteration of the algorithm. Here, the probabilistic den-
sity of input p�x|i, mi, 6i� is set to Gi�x|mi, 6i�. Since
p�y|x, i, Wi, Si� and p�x|i, mi, 6i� are calculated in each com-
ponent independently, we can transform Eq. (3) into
Here, we can regard 4�t� as constant values, and then qi is
dependent just on the i-th component. We can therefore
apply the SMEM algorithm to the NGnet. Fortunately, the
NGnet provides such a merit that Q�4|4�t�� is a second-or-
der function of the parameters 4 under the assumption of
the joint probability of the input and output.
On the other hand, because we do not treat input x as
a probability in the MEnet, the Q function is given by
where P�i|xn, 4� is given by
Here, si�x, mi, 6i� denotes the i-th output of the gating net-
work. This indicates that the parameters in each component
are unable to be independently estimated in the MEnet
because P�i|xn, 4� is dependent on all components. The
MEnet, therefore, needs another iterative calculation for
maximization of the Q function value [4].
In the NGnet, the maximization of the Q function
value is analytically given. Suppose that data set D is
regarded as a joint distribution and that the transformation
by fi �x; Wi� in each component is linear as follows:
then, the optimal parameter values for the maximization of
the Q function value are given by
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
55
3. SMEM Algorithm for Regression
3.1. Overview
The SMEM algorithm is a method designed to over-
come the local optima problem of the EM algorithm. It
applies SM (split and merge) operations to the network
components that do not well fit data, and avoids local
optima to get a better solution [2].
An overview of the SMEM algorithm is described
below.
Step 1. Perform the usual EM updates from some
initial parameter values 4 until convergence, and then set
the estimated parameters and the corresponding Q function
value to 4* and Q*, respectively.
Step 2. Compute the SM criteria (see Section 3.2)
based on 4* and set the order of priority of the SM candi-
dates {i, j, k}c, where i, j, and k are the indices of the
network components and c denotes the order of priority.
Step 3. Perform SM operations on the highest priority
candidate {i, j, k}c as follows: set the initial parameter
values of the created components, and then estimate pa-
rameter values 4** and the corresponding Q function value
Q** (see Section 3.2.4).
Step 4. If Q** > Q*, then set Q** o Q*, 4** o 4* and
go to Step 2. Otherwise, go to Step 3 and perform SM
operations on the next candidate. If no candidate is left, go
to Step 5.
Step 5. Halt with 4* as the final parameter values.
This algorithm performs the split operation and the
merge operation at the same time to fix the total number of
parameters during the learning. This is because the likeli-
hood value generally goes up as the number of parameters
increases.
3.2. Adaptation for regression
We investigate what should be changed in the SMEM
algorithm to adapt to regression problems.
3.2.1. Merge criterion
If there are a lot of observed data each of which has
a similar posterior probabili ty P�i|xn, yn, 4�t�� u
P�j|xn, yn, 4�t�� for any two components, we should merge
these two components for a better distribution estimation.
The merge criterion Jmerge�i, j� is an indicator of two com-
ponents i, j to be merged.
In regression problems, the observed data are a com-
bination of the input and output, while in distribution esti-
mation problems, the observed data are only the input. We
therefore replace posterior probability P�i|x, 4� described
for density estimation [2], with P(i| x, y, 4) for regression.
That is, the merge criterion for regression is given by
where vector Pi � RN denotes
Note that the larger Jmerge is, the higher the priority of the
pair of components i and j is.
3.2.2. Split criterion
The split criterion Jsplit�k� is utilized to select a com-
ponent that does not have a well-estimated distribution. The
selected component is split into two components to get a
better estimation.
A modified Kullback�Leibler (KL) divergence is
utilized as the split criterion in mixture distribution estima-
tion [2]. For the same reason as described in the previous
section, we replace estimated distribution p�x|k, 4� and
empirical distribution fk�x� for distribution estimation with
p�x, y|k, 4� and fk�x, y� for regression, respectively. That is,
the split criterion is
where empirical distribution fk�x, y� is given by
Note that the larger Jsplit�k� is, the higher the priority of the
component k is.
3.2.3. Priority of split and merge candidates
The order of priority of the SM (split and merge)
operations strongly influences the calculation time in the
SMEM algorithm. Therefore, how to set the order of prior-
ity is very important. As described above, the split criterion
and the merge criterion are independently calculated, but
how can we set the order of the split and merge candidates
as a combination? Simple alternatives are the merge-first
method and split-first method.
The merge-first method is described below. First,
select the merge candidates {i, j}c having the highest pri-
ority by Jmerge�i, j�. Next, select the split candidates {k}c
having the highest priority among the remaining compo-
nents by Jsplit�k�; k z i, j. This process gives the order c = 1,
. . . , M � 2 to the SM candidates {i, j, k}c. When more
candidates are necessary, repeat the above.
The split-first method is described as follows. First,
select the split candidates {k}c c. Next, select the merge
(10)
(11)
(12)
56
candidates {i, jc ; i, j z k from among the remaining com-
ponents. This process gives the order c = 1, . . . , (M � 1)
(M � 2)/2 to the SM candidates {i, j, k}c. When more
candidates are necessary, repeat the above.
We choose the latter method because of the results of
the computational experiments described in Section 4.
3.2.4. Learning in split and merge operations
In the SM operations, we create a new component ic
by merging the components i and j, and also create new
components jc and kc by splitting the component k. The
initial parameter values of component ic are set by replacing
the posterior probability P(i|x, y, 4) in Eqs. (6) to (9) with
For components jc and kc, the initial parameter values are
set by adding small perturbations e to the parameters of
component k:
where m = jc, kc.
Reestimating parameters in the SM operations con-
sists of two steps: the first step is the partial-EM step and
the second step is the full-EM step. In the partial-EM step,
we estimate the parameters of three newly created compo-
nents ic, jc, kc without influence on other components. To put
it concretely, these parameters are renewed by Eqs. (6) to
(9) with the posterior probability replaced with
where m = ic, jc, kc. After the learning converges in the
partial-EM step, the parameters of all components are rees-
timated by the usual EM algorithm, which is the full-EM
step.
4. Computational Experiments
We evaluate the effectiveness of the SMEM algo-
rithm by both 3D surface estimation using artificial data and
time series estimation using high-dimensional real data.
4.1. 3D surface estimation
We first show experiments of 3D surface estimation
using 2D input and 1D output data. Several examples of the
learning process and results using a five-component net-
work are shown in Fig. 1. The target function which trans-
forms 2D input into 1D output is plotted in Fig. 1(a).
For the training, we prepared 1000 data points ran-
domly selected in the input space and added 5% Gaussian
noise to their output values. We evaluated the regression
results at 21 u 21 data points selected at the same interval
within the input space.
The regression results based on some initial parame-
ter values are shown in Fig. 1(b). (b-1) is the estimated
surface. In (b-2), the ovals show the input distributions
estimated by each component and the target function is
illustrated as the contour lines in the background.
Figure 1(c) shows a result obtained by the usual EM
algorithm. In this simulation, the learning converged after
172 iterations of the EM procedure. The estimated surface
(c-1) looks closer to the target than the reconstruction by
the initial parameter values (b-1). However, the input dis-
tributions estimated by each component do not fit because,
although these components must obtain nonlinear transfor-
mations to reconstruct the target surface, the ability of each
component is limited to just linear transformations (c-2).
The learning results in the process are illustrated in
Figs. 1(d) and 1(e). These figures describe results after 202
and 387 EM procedures, respectively, where the numbers
include the EM procedures rejected in Step 4 of the SMEM
algorithm. It is seen that the estimated surface gets closer
to the target as the learning proceeds.
Figure 1(f) shows the final result obtained by the
SMEM algorithm. The total number of EM procedures in
this learning is 580. (f-1) shows a better estimated surface
than (c), (d), or (e). The input distributions estimated by
each component are reasonable because the transformation
that the components should obtain is almost linear (f-2).
Table 1 shows differences between the EM and
SMEM algorithm. In the table, the statistics of 100 simula-
tions using random initial values are compared for the two
algorithms. The average (ave.), standard deviation (std.),
maximum (max.), and minimum (min.) values of the log-
likelihood, and calculation volume (i.e., total number of
EM procedures) are shown for each algorithm. The SMEM
algorithm shows a better log-likelihood than the EM algo-
rithm; nevertheless, it costs about 13 times more in terms
of computation.
These results indicate that a simulation of the SMEM
algorithm is almost equal to 13 simulations of the EM
algorithm in terms of calculation volume. It is therefore
reasonable to compare these two algorithms in the same
calculation volume, that is, comparison of a single simula-
tion by the SMEM algorithm with the best result of 13
(13)
(14)
(15)
(16)
(17)
57
simulations by the EM algorithm. Table 2 shows averages
and standard deviations of 30 simulations by the SMEM
algorithm and 30 sets of simulations in which each set is
the best result of 13 simulations by the EM algorithm. We
can confirm the advantage of the SMEM algorithm even for
the same calculation volume from these results.
The order of priority of SM candidates
We described two methods for determining the order
of priority of SM candidates in Section 3.2.3: the split-first
Fig. 1. An example of learning processes. (a) Target function; (b) estimation result from initial values; (c) result by the EM
algorithm; (d, e) results in processes of the SMEM algorithm; (f) final result by the SMEM algorithm.
Table 1. Comparison between two algorithms
58
method and the merge-first method. The following de-
scribes experiments investigating which of the two methods
is more efficient.
As previously mentioned, the SMEM algorithm re-
jects an estimate by an SM candidate if Q d Q . We
therefore took statistics of the earliest priority of the SM
candidate accepted in each SM operation. Figure 2 shows
the result of 100 simulations. The acceptance ratio is plotted
over the order of priority for (a) the split-first method and
(b) the merge-first method. As the order of the accepted
candidate is earlier, the SMEM algorithm performs less SM
operations. That is, the efficiency of the SMEM algorithm
is superior to that of the EM algorithm.
In the split-first method, an earlier order has a higher
acceptance ratio (a). On the other hand, the relationship
between the order of priority and acceptance ratio is not
clear in the merge-first method, which means more SM
operations are uselessly performed (b). These results are
also obvious in (c), which illustrates the number of accu-
mulations of the acceptance in each method. The split-first
method has larger values over all of the orders than the
merge-first method. That is, the split-first method is more
efficient.
4.2. Time series prediction
We discuss time series prediction using high-dimen-
sional real data in this section. We assume one-dimensional
time series data observed in a far-infrared laser [7].� We
consider any successive 25 data points as a high-dimen-
sional vector, and estimate the data point following the 25
data points.
The 25-dimensional vector, which is regarded as the
input, is given by
Here, the covariance matrix 6i calculated by a computer in
accordance with Eq. (7) is not always a regular matrix
because of both the sparseness of the input distribution and
the computer�s imprecision in the high-order calculation. In
the experiments, therefore, we usewhere e and ec are very small values.
The observed one-dimensional data are plotted in
Fig. 3(a). We regarded any 25 successive data points as the
input and the following data point as the output in the
Table 2. Comparison with the same calculation volume
Fig. 2. Comparison between two types of computations
determining the order of priority of SM candidates. (a)
Acceptance ratios of each order by the split-first
method; (b) acceptance ratios of each order by
the merge-first method; (c) comparison of the
number of accumulations between the two
methods.
(18)
�These data were used in the Santa Fe Institute Time Series Prediction and
Analysis Competition.
59
experiments. For training, 975 data sets were prepared from
the data. We performed two experiments: estimation of the
following data point and iterative estimation from the input
vector. Figure 3(b) illustrates an example of the results
using a 50-component network. The solid line shows the
target data, the thick dashed line is the result of the follow-
ing data estimation from each input, and the thin dashed
line is the result incrementally estimated from the first 25
data points.
For the estimation of the following data point, Table 3
compares the statistics of the results of 10 simulations by the
SMEM algorithm and the EM algorithm in both 10- and
50-component networks. The statistics of log-likelihood val-
ues are also shown. All of these results indicate that the SMEM
algorithm has a clear advantage over the EM algorithm.
Comparisons with other algorithms are shown in
Table 4. These values are statistics of squared errors of
estimations. Here, MLP is assumed to use the minimum
squared error method by a three-layer network with ten
hidden units and MDL is assumed to use the regularization
method proposed by Saito and Nakano [8]. No significant
differences can be seen between the SMEM algorithm
using a 50-component network and MDL.
As described above, the SMEM algorithm well esti-
mates the following data point from each input. On the other
hand, the results of iterative estimation show large errors:
the average and standard deviation of squared errors are
0.20 and 0.234e-4, respectively, in the case of a 50-compo-
nent network. These results indicate that the NGnet model
is unable to estimate unsteady dynamic data.
5. Discussion
5.1. Distribution estimation and regression
with joint probability
We can regard complete data zn �xn, yn� as a distri-
bution and perform distribution estimation. What is the
Fig. 3. An experiment using real data. (a) Training data; (b) test data (solid line) and estimation results by the SMEM
algorithm (dashed lines).
Table 3. Results based on real data
Table 4. Comparison with other methods
60
difference between distribution estimation and regression
under the assumption of joint probability? We discuss this
question below.
As far as mixture distribution estimation is con-
cerned, the likelihood of complete data zn �xn, yn� is writ-
ten
The unknown parameters in this equation are the average
and covariance matrix of the complete data in each com -
ponent: mi � Rdx�d
y and 6i � R�dx�d
y�u�d
x�d
y�, respectively.
On the other hand, the likelihood function of the com-
plete data in regression estimation is, as described in
Section 2, given by
The differences between the two equations are the
first term on the right-hand side of Eq. (20), namely,
p�yn|xn, i, Wi, Si�, and more importantly, newly added un-
known parameters Wi, Si. Wi and Si denote the transforma-
tion from the input to the output and the covariance matrix
of the output in each component, respectively. That is, the
difference between the two types of estimations lies in what
parameters are to be estimated.
In regression using the NGnet in particular, the
unknown parameters are mi � Rdx, 6i � Rd
xud
x, Wi �
R�dx�1�ud
y, Si � Rdyud
y. That is, the number of unknown pa-
rameters is less than in the distribution estimation described
in Eq. (19).
5.2. Merit of joint probability for regression
The number of unknown parameters is generally
larger in a model of joint probability than in a model
regarding only the output as a probability. This is because
the input distribution also needs to be estimated. The esti-
mation accuracy by the model of joint probability, there-
fore, is worse when the number of data sets is not much
larger than the number of unknown parameters.
Using the NGnet, the number of unknown parameters
is the same between the models of joint probability and
output probability. This is because even the model of output
probability must estimate average mi and covariance matrix
6i of the input, which are parameters of the Gaussian
functions in the gating network. Therefore, the model of
joint probability is considered more effective in the NGnet
than the model of output probability because of the analytic
solution ability for optimization.
To clarify this difference in the effectiveness of the
two models, we compared calculation times and squared
errors for regression using a three-component network.The
target function was set to y = 10 u tanh(x), {�10 < x < 10}.
Table 5 shows statistics of ten simulations using
random initial parameter values. The average calculation
time for the model of joint probability is about 1/25 that for
the model of output probability. The average squared error
is also smaller in the former model. These results show the
effect of the analytical scheme for optimization in the
former model. The standard deviation is also smaller in the
former model, which means that the former model is less
influenced by the initial parameter values. Taken together,
these results indicate that the joint probability model gives
us fast, accurate, and stable learning in the NGnet.
6. Conclusion
We showed the effectiveness of the SMEM algorithm
for mixture regression problems using the NGnet. We also
clarified what should be changed in the split and merge
operations and how to place the split and merge candidates
in order. The results of computational experiments using
artificial and real data indicated that the estimation accuracy
by the SMEM algorithm is better than that by the EM
algorithm for regression problems, and also that the as-
sumption of joint probability provides fast, accurate, and
stable learning.
REFERENCES
1. Dempster AP, Laird NM, Rubin DB. Maximum like-
lihood from incomplete data via the EM algorithm. J
R Stat Soc B 1977;39:271�282.
2. Ueda N, Nakano R, Gharamani Z, Hinton GE.
SMEM algorithm for mixture models. Neural Com-
put 2000;12:2109�2128.
(19)
(20)
Table 5. Effects of assumption of joint distribution
61
3. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adap-
tive mixtures of local experts. Neural Comput
1991;3:79�87.
4. Jordan MI, Jacobs RA. Hierarchical mixtures of ex-
perts and the EM algorithm. Neural Comput
1994;6:181�214.
5. Xu L, Jordan MI, Hinton GE. An alternative model
for mixtures of experts. NIPS 7. MIT Press; 1995. p
633�640.
6. Ishii, Sato. Normalized Gaussian network. Mixture
of experts and EM algorithm. JNNS 1999;6:30�40.
(in Japanese)
7. Weigend A, Gershenfeld N. Time series prediction:
Forecasting future and understanding the past. Ad-
dison�Wesley; 1993.
8. Saito K, Nakano R. A new regularization based on
the MDL principle. JSAI 1998;13:123�130.
AUTHORS (from left to right)
Satoshi Suzuki (member) received his B.S. degree in basic sciences from the University of Tokyo in 1990 and then joined
Nippon Telegraph and Telephone Corporation (NTT) Laboratories. Currently, he is a research scientist at NTT Communication
Science Laboratories. From 1992 to 1997, he was a research scientist at ATR Human Information Processing Research
Laboratories. He received the Japanese Neural Network Society (JNNS) Research Award in 1994. He is a member of JNNS and
IEICE.
Naonori Ueda (member) received his B.S., M.S., and Ph.D. degrees in communication engineering from Osaka University
in 1982, 1984, and 1992. In 1984, he joined the Electrical Communication Laboratories, Nippon Telegraph and Telephone
Corporation (NTT). Currently, he is a senior research scientist (distinguished technical member) at NTT Communication Science
Laboratories, and a guest associate professor at Nara Advanced Institute of Science and Technology. He was a visiting scholar
at Purdue University in 1994�95. He received the Japanese Neural Network Society (JNNS) Research Award, the 12th
Telecommunication Advancement Foundation Award (Telecom System Technology prize), and a paper award from IEICE in
1995, 1997, and 2000, respectively. He is a member of JNNS, the Japan Statistical Society, and IEEE.
62