smem algorithm for mixture regression problems

SMEM Algorithm for Mixture Regression Problems

Satoshi Suzuki and Naonori Ueda

NTT Communication Science Laboratories, Kyoto, 619-0237 Japan

SUMMARY

We investigate applications of the SMEM (Split and

Merge EM) algorithm to regression problems. We use the

NGnet (Normalized Gaussian network) as a mixture regres-

sion model in which a set of input and output variables is

dealt with as a joint probability. We describe the split and

merge operations and criteria necessary for applying the

SMEM algorithm to the NGnet. We show the usefulness of

the SMEM algorithm for regression problems through ex-

periments using artificial and real data. © 2001 Scripta

Technica, Electron Comm Jpn Pt 2, 84(12): 54�62, 2001

Key words: EM algorithm; regression; NGnet;

mixture model; maximum likelihood estimation; joint dis-

tribution.

1. Introduction

The SMEM (Split and Merge EM) algorithm pro-

posed by one of the authors is a method based on the EM

algorithm for mixture models [1]. The algorithm avoids

local optima and leads to better results by splitting and

merging its components. The effectiveness of the algorithm

for mixture density estimation and data compression has

been shown in computational experiments [2]. Until now,

however, its effectiveness for regression problems had not

been clarified because of the differences in the charac-

teristics between regression and distribution estimation. We

mainly show the effectiveness of the SMEM algorithm for

regression problems in this paper.

The MEnet (Mixture Expert network) proposed by

Jacobs and colleagues [3] is generally popular as a mixture

regression model. A learning method using the EM algo-

rithm has also been proposed for the MEnet [4]. However,

we cannot apply this method with the SMEM algorithm

because the Q function of the SMEM algorithm must be repre-

sented in the form of a direct sum over its components [2].

Xu and colleagues proposed and demonstrated a

learning network based on the MEnet in which they re-

placed the gating network with Gaussian functions for fast

convergence of the EM algorithm [5]. They also showed

that when a set of input and output variables is dealt with

as a joint probability, the Q function can be expanded such

that the algorithm can be represented as a direct sum over

its components, and therefore, the parameter values are

independently calculated in each component. That is, we

can apply the SMEM algorithm to this network, which we

call the NGnet (Normalized Gaussian network) [6]. Since

the SMEM algorithm repeatedly performs the EM algo-

rithm in the process, we can expect a much faster conver-

gence toward optimization by the combination of the

SMEM algorithm and the NGnet.

As mentioned above, we verify the effectiveness of

the SMEM algorithm for mixture regression problems with

the NGnet. We also modify the split and merge operations

to accommodate regression, which is the core operation in

the SMEM algorithm.

2. NGnet and SMEM Algorithm

The NGnet has a modular structure like the MEnet.

The transformation of input x � Rdx to output y � Rd

y is

given by

© 2001 Scripta Technica

Electronics and Communications in Japan, Part 2, Vol. 84, No. 12, 2001Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J83-D-II, No. 12, December 2000, pp. 2777�2785

54

where Gi�x|mi, 6i� denotes a dx-dimensional normal distri-

bution:

Here, M is the number of components and mi, 6i, and Wi are

the mean vector, the covariance matrix, and the transforma-

tion matrix of the i-th component, respectively.

The differences between the NGnet and the MEnet

are in the structure of the gating network and, more impor-

tant, the treatment of the input as a probabilistic variance.

That is, in the NGnet model, we treat data sets

D {�xn, yn�, n 1, . . . , N} as joint probabilistic distribu-

tions; on the other hand, in the MEnet model, we regard

only output data y as a probabilistic distribution.

As with general regression models, we regard the

distribution of output y as a Gaussian distribution in which

the mean vector is fi�xi; Wi� and the covariance matrix is Si.

That is, the posterior probability of output y is given by

Therefore, the parameters we should estimate are

The log-likelihood of the complete data �xn, yn� is given by

where P(i) means the prior probability of model selection

and its value is set to 1/M as a uniform distribution.

From the above, the Q function in the NGnet is given

by

where 4�t� denotes the parameter values estimated after the

t-th iteration of the algorithm. Here, the probabilistic den-

sity of input p�x|i, mi, 6i� is set to Gi�x|mi, 6i�. Since

p�y|x, i, Wi, Si� and p�x|i, mi, 6i� are calculated in each com-

ponent independently, we can transform Eq. (3) into

Here, we can regard 4�t� as constant values, and then qi is

dependent just on the i-th component. We can therefore

apply the SMEM algorithm to the NGnet. Fortunately, the

NGnet provides such a merit that Q�4|4�t�� is a second-or-

der function of the parameters 4 under the assumption of

the joint probability of the input and output.

On the other hand, because we do not treat input x as

a probability in the MEnet, the Q function is given by

where P�i|xn, 4� is given by

Here, si�x, mi, 6i� denotes the i-th output of the gating net-

work. This indicates that the parameters in each component

are unable to be independently estimated in the MEnet

because P�i|xn, 4� is dependent on all components. The

MEnet, therefore, needs another iterative calculation for

maximization of the Q function value [4].

In the NGnet, the maximization of the Q function

value is analytically given. Suppose that data set D is

regarded as a joint distribution and that the transformation

by fi �x; Wi� in each component is linear as follows:

then, the optimal parameter values for the maximization of

the Q function value are given by

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

55

3. SMEM Algorithm for Regression

3.1. Overview

The SMEM algorithm is a method designed to over-

come the local optima problem of the EM algorithm. It

applies SM (split and merge) operations to the network

components that do not well fit data, and avoids local

optima to get a better solution [2].

An overview of the SMEM algorithm is described

below.

Step 1. Perform the usual EM updates from some

initial parameter values 4 until convergence, and then set

the estimated parameters and the corresponding Q function

value to 4* and Q*, respectively.

Step 2. Compute the SM criteria (see Section 3.2)

based on 4* and set the order of priority of the SM candi-

dates {i, j, k}c, where i, j, and k are the indices of the

network components and c denotes the order of priority.

Step 3. Perform SM operations on the highest priority

candidate {i, j, k}c as follows: set the initial parameter

values of the created components, and then estimate pa-

rameter values 4** and the corresponding Q function value

Q** (see Section 3.2.4).

Step 4. If Q** > Q*, then set Q** o Q*, 4** o 4* and

go to Step 2. Otherwise, go to Step 3 and perform SM

operations on the next candidate. If no candidate is left, go

to Step 5.

Step 5. Halt with 4* as the final parameter values.

This algorithm performs the split operation and the

merge operation at the same time to fix the total number of

parameters during the learning. This is because the likeli-

hood value generally goes up as the number of parameters

increases.

3.2. Adaptation for regression

We investigate what should be changed in the SMEM

algorithm to adapt to regression problems.

3.2.1. Merge criterion

If there are a lot of observed data each of which has

a similar posterior probabili ty P�i|xn, yn, 4�t�� u

P�j|xn, yn, 4�t�� for any two components, we should merge

these two components for a better distribution estimation.

The merge criterion Jmerge�i, j� is an indicator of two com-

ponents i, j to be merged.

In regression problems, the observed data are a com-

bination of the input and output, while in distribution esti-

mation problems, the observed data are only the input. We

therefore replace posterior probability P�i|x, 4� described

for density estimation [2], with P(i| x, y, 4) for regression.

That is, the merge criterion for regression is given by

where vector Pi � RN denotes

Note that the larger Jmerge is, the higher the priority of the

pair of components i and j is.

3.2.2. Split criterion

The split criterion Jsplit�k� is utilized to select a com-

ponent that does not have a well-estimated distribution. The

selected component is split into two components to get a

better estimation.

A modified Kullback�Leibler (KL) divergence is

utilized as the split criterion in mixture distribution estima-

tion [2]. For the same reason as described in the previous

section, we replace estimated distribution p�x|k, 4� and

empirical distribution fk�x� for distribution estimation with

p�x, y|k, 4� and fk�x, y� for regression, respectively. That is,

the split criterion is

where empirical distribution fk�x, y� is given by

Note that the larger Jsplit�k� is, the higher the priority of the

component k is.

3.2.3. Priority of split and merge candidates

The order of priority of the SM (split and merge)

operations strongly influences the calculation time in the

SMEM algorithm. Therefore, how to set the order of prior-

ity is very important. As described above, the split criterion

and the merge criterion are independently calculated, but

how can we set the order of the split and merge candidates

as a combination? Simple alternatives are the merge-first

method and split-first method.

The merge-first method is described below. First,

select the merge candidates {i, j}c having the highest pri-

ority by Jmerge�i, j�. Next, select the split candidates {k}c

having the highest priority among the remaining compo-

nents by Jsplit�k�; k z i, j. This process gives the order c = 1,

. . . , M � 2 to the SM candidates {i, j, k}c. When more

candidates are necessary, repeat the above.

The split-first method is described as follows. First,

select the split candidates {k}c c. Next, select the merge

(10)

(11)

(12)

56

candidates {i, jc ; i, j z k from among the remaining com-

ponents. This process gives the order c = 1, . . . , (M � 1)

(M � 2)/2 to the SM candidates {i, j, k}c. When more

candidates are necessary, repeat the above.

We choose the latter method because of the results of

the computational experiments described in Section 4.

3.2.4. Learning in split and merge operations

In the SM operations, we create a new component ic

by merging the components i and j, and also create new

components jc and kc by splitting the component k. The

initial parameter values of component ic are set by replacing

the posterior probability P(i|x, y, 4) in Eqs. (6) to (9) with

For components jc and kc, the initial parameter values are

set by adding small perturbations e to the parameters of

component k:

where m = jc, kc.

Reestimating parameters in the SM operations con-

sists of two steps: the first step is the partial-EM step and

the second step is the full-EM step. In the partial-EM step,

we estimate the parameters of three newly created compo-

nents ic, jc, kc without influence on other components. To put

it concretely, these parameters are renewed by Eqs. (6) to

(9) with the posterior probability replaced with

where m = ic, jc, kc. After the learning converges in the

partial-EM step, the parameters of all components are rees-

timated by the usual EM algorithm, which is the full-EM

step.

4. Computational Experiments

We evaluate the effectiveness of the SMEM algo-

rithm by both 3D surface estimation using artificial data and

time series estimation using high-dimensional real data.

4.1. 3D surface estimation

We first show experiments of 3D surface estimation

using 2D input and 1D output data. Several examples of the

learning process and results using a five-component net-

work are shown in Fig. 1. The target function which trans-

forms 2D input into 1D output is plotted in Fig. 1(a).

For the training, we prepared 1000 data points ran-

domly selected in the input space and added 5% Gaussian

noise to their output values. We evaluated the regression

results at 21 u 21 data points selected at the same interval

within the input space.

The regression results based on some initial parame-

ter values are shown in Fig. 1(b). (b-1) is the estimated

surface. In (b-2), the ovals show the input distributions

estimated by each component and the target function is

illustrated as the contour lines in the background.

Figure 1(c) shows a result obtained by the usual EM

algorithm. In this simulation, the learning converged after

172 iterations of the EM procedure. The estimated surface

(c-1) looks closer to the target than the reconstruction by

the initial parameter values (b-1). However, the input dis-

tributions estimated by each component do not fit because,

although these components must obtain nonlinear transfor-

mations to reconstruct the target surface, the ability of each

component is limited to just linear transformations (c-2).

The learning results in the process are illustrated in

Figs. 1(d) and 1(e). These figures describe results after 202

and 387 EM procedures, respectively, where the numbers

include the EM procedures rejected in Step 4 of the SMEM

algorithm. It is seen that the estimated surface gets closer

to the target as the learning proceeds.

Figure 1(f) shows the final result obtained by the

SMEM algorithm. The total number of EM procedures in

this learning is 580. (f-1) shows a better estimated surface

than (c), (d), or (e). The input distributions estimated by

each component are reasonable because the transformation

that the components should obtain is almost linear (f-2).

Table 1 shows differences between the EM and

SMEM algorithm. In the table, the statistics of 100 simula-

tions using random initial values are compared for the two

algorithms. The average (ave.), standard deviation (std.),

maximum (max.), and minimum (min.) values of the log-

likelihood, and calculation volume (i.e., total number of

EM procedures) are shown for each algorithm. The SMEM

algorithm shows a better log-likelihood than the EM algo-

rithm; nevertheless, it costs about 13 times more in terms

of computation.

These results indicate that a simulation of the SMEM

algorithm is almost equal to 13 simulations of the EM

algorithm in terms of calculation volume. It is therefore

reasonable to compare these two algorithms in the same

calculation volume, that is, comparison of a single simula-

tion by the SMEM algorithm with the best result of 13

(13)

(14)

(15)

(16)

(17)

57

simulations by the EM algorithm. Table 2 shows averages

and standard deviations of 30 simulations by the SMEM

algorithm and 30 sets of simulations in which each set is

the best result of 13 simulations by the EM algorithm. We

can confirm the advantage of the SMEM algorithm even for

the same calculation volume from these results.

The order of priority of SM candidates

We described two methods for determining the order

of priority of SM candidates in Section 3.2.3: the split-first

Fig. 1. An example of learning processes. (a) Target function; (b) estimation result from initial values; (c) result by the EM

algorithm; (d, e) results in processes of the SMEM algorithm; (f) final result by the SMEM algorithm.

Table 1. Comparison between two algorithms

58

method and the merge-first method. The following de-

scribes experiments investigating which of the two methods

is more efficient.

As previously mentioned, the SMEM algorithm re-

jects an estimate by an SM candidate if Q d Q . We

therefore took statistics of the earliest priority of the SM

candidate accepted in each SM operation. Figure 2 shows

the result of 100 simulations. The acceptance ratio is plotted

over the order of priority for (a) the split-first method and

(b) the merge-first method. As the order of the accepted

candidate is earlier, the SMEM algorithm performs less SM

operations. That is, the efficiency of the SMEM algorithm

is superior to that of the EM algorithm.

In the split-first method, an earlier order has a higher

acceptance ratio (a). On the other hand, the relationship

between the order of priority and acceptance ratio is not

clear in the merge-first method, which means more SM

operations are uselessly performed (b). These results are

also obvious in (c), which illustrates the number of accu-

mulations of the acceptance in each method. The split-first

method has larger values over all of the orders than the

merge-first method. That is, the split-first method is more

efficient.

4.2. Time series prediction

We discuss time series prediction using high-dimen-

sional real data in this section. We assume one-dimensional

time series data observed in a far-infrared laser [7].� We

consider any successive 25 data points as a high-dimen-

sional vector, and estimate the data point following the 25

data points.

The 25-dimensional vector, which is regarded as the

input, is given by

Here, the covariance matrix 6i calculated by a computer in

accordance with Eq. (7) is not always a regular matrix

because of both the sparseness of the input distribution and

the computer�s imprecision in the high-order calculation. In

the experiments, therefore, we usewhere e and ec are very small values.

The observed one-dimensional data are plotted in

Fig. 3(a). We regarded any 25 successive data points as the

input and the following data point as the output in the

Table 2. Comparison with the same calculation volume

Fig. 2. Comparison between two types of computations

determining the order of priority of SM candidates. (a)

Acceptance ratios of each order by the split-first

method; (b) acceptance ratios of each order by

the merge-first method; (c) comparison of the

number of accumulations between the two

methods.

(18)

�These data were used in the Santa Fe Institute Time Series Prediction and

Analysis Competition.

59

experiments. For training, 975 data sets were prepared from

the data. We performed two experiments: estimation of the

following data point and iterative estimation from the input

vector. Figure 3(b) illustrates an example of the results

using a 50-component network. The solid line shows the

target data, the thick dashed line is the result of the follow-

ing data estimation from each input, and the thin dashed

line is the result incrementally estimated from the first 25

data points.

For the estimation of the following data point, Table 3

compares the statistics of the results of 10 simulations by the

SMEM algorithm and the EM algorithm in both 10- and

50-component networks. The statistics of log-likelihood val-

ues are also shown. All of these results indicate that the SMEM

algorithm has a clear advantage over the EM algorithm.

Comparisons with other algorithms are shown in

Table 4. These values are statistics of squared errors of

estimations. Here, MLP is assumed to use the minimum

squared error method by a three-layer network with ten

hidden units and MDL is assumed to use the regularization

method proposed by Saito and Nakano [8]. No significant

differences can be seen between the SMEM algorithm

using a 50-component network and MDL.

As described above, the SMEM algorithm well esti-

mates the following data point from each input. On the other

hand, the results of iterative estimation show large errors:

the average and standard deviation of squared errors are

0.20 and 0.234e-4, respectively, in the case of a 50-compo-

nent network. These results indicate that the NGnet model

is unable to estimate unsteady dynamic data.

5. Discussion

5.1. Distribution estimation and regression

with joint probability

We can regard complete data zn �xn, yn� as a distri-

bution and perform distribution estimation. What is the

Fig. 3. An experiment using real data. (a) Training data; (b) test data (solid line) and estimation results by the SMEM

algorithm (dashed lines).

Table 3. Results based on real data

Table 4. Comparison with other methods

60

difference between distribution estimation and regression

under the assumption of joint probability? We discuss this

question below.

As far as mixture distribution estimation is con-

cerned, the likelihood of complete data zn �xn, yn� is writ-

ten

The unknown parameters in this equation are the average

and covariance matrix of the complete data in each com -

ponent: mi � Rdx�d

y and 6i � R�dx�d

y�u�d

x�d

y�, respectively.

On the other hand, the likelihood function of the com-

plete data in regression estimation is, as described in

Section 2, given by

The differences between the two equations are the

first term on the right-hand side of Eq. (20), namely,

p�yn|xn, i, Wi, Si�, and more importantly, newly added un-

known parameters Wi, Si. Wi and Si denote the transforma-

tion from the input to the output and the covariance matrix

of the output in each component, respectively. That is, the

difference between the two types of estimations lies in what

parameters are to be estimated.

In regression using the NGnet in particular, the

unknown parameters are mi � Rdx, 6i � Rd

xud

x, Wi �

R�dx�1�ud

y, Si � Rdyud

y. That is, the number of unknown pa-

rameters is less than in the distribution estimation described

in Eq. (19).

5.2. Merit of joint probability for regression

The number of unknown parameters is generally

larger in a model of joint probability than in a model

regarding only the output as a probability. This is because

the input distribution also needs to be estimated. The esti-

mation accuracy by the model of joint probability, there-

fore, is worse when the number of data sets is not much

larger than the number of unknown parameters.

Using the NGnet, the number of unknown parameters

is the same between the models of joint probability and

output probability. This is because even the model of output

probability must estimate average mi and covariance matrix

6i of the input, which are parameters of the Gaussian

functions in the gating network. Therefore, the model of

joint probability is considered more effective in the NGnet

than the model of output probability because of the analytic

solution ability for optimization.

To clarify this difference in the effectiveness of the

two models, we compared calculation times and squared

errors for regression using a three-component network.The

target function was set to y = 10 u tanh(x), {�10 < x < 10}.

Table 5 shows statistics of ten simulations using

random initial parameter values. The average calculation

time for the model of joint probability is about 1/25 that for

the model of output probability. The average squared error

is also smaller in the former model. These results show the

effect of the analytical scheme for optimization in the

former model. The standard deviation is also smaller in the

former model, which means that the former model is less

influenced by the initial parameter values. Taken together,

these results indicate that the joint probability model gives

us fast, accurate, and stable learning in the NGnet.

6. Conclusion

We showed the effectiveness of the SMEM algorithm

for mixture regression problems using the NGnet. We also

clarified what should be changed in the split and merge

operations and how to place the split and merge candidates

in order. The results of computational experiments using

artificial and real data indicated that the estimation accuracy

by the SMEM algorithm is better than that by the EM

algorithm for regression problems, and also that the as-

sumption of joint probability provides fast, accurate, and

stable learning.

REFERENCES

1. Dempster AP, Laird NM, Rubin DB. Maximum like-

lihood from incomplete data via the EM algorithm. J

R Stat Soc B 1977;39:271�282.

2. Ueda N, Nakano R, Gharamani Z, Hinton GE.

SMEM algorithm for mixture models. Neural Com-

put 2000;12:2109�2128.

(19)

(20)

Table 5. Effects of assumption of joint distribution

61

3. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adap-

tive mixtures of local experts. Neural Comput

1991;3:79�87.

4. Jordan MI, Jacobs RA. Hierarchical mixtures of ex-

perts and the EM algorithm. Neural Comput

1994;6:181�214.

5. Xu L, Jordan MI, Hinton GE. An alternative model

for mixtures of experts. NIPS 7. MIT Press; 1995. p

633�640.

6. Ishii, Sato. Normalized Gaussian network. Mixture

of experts and EM algorithm. JNNS 1999;6:30�40.

(in Japanese)

7. Weigend A, Gershenfeld N. Time series prediction:

Forecasting future and understanding the past. Ad-

dison�Wesley; 1993.

8. Saito K, Nakano R. A new regularization based on

the MDL principle. JSAI 1998;13:123�130.

AUTHORS (from left to right)

Satoshi Suzuki (member) received his B.S. degree in basic sciences from the University of Tokyo in 1990 and then joined

Nippon Telegraph and Telephone Corporation (NTT) Laboratories. Currently, he is a research scientist at NTT Communication

Science Laboratories. From 1992 to 1997, he was a research scientist at ATR Human Information Processing Research

Laboratories. He received the Japanese Neural Network Society (JNNS) Research Award in 1994. He is a member of JNNS and

IEICE.

Naonori Ueda (member) received his B.S., M.S., and Ph.D. degrees in communication engineering from Osaka University

in 1982, 1984, and 1992. In 1984, he joined the Electrical Communication Laboratories, Nippon Telegraph and Telephone

Corporation (NTT). Currently, he is a senior research scientist (distinguished technical member) at NTT Communication Science

Laboratories, and a guest associate professor at Nara Advanced Institute of Science and Technology. He was a visiting scholar

at Purdue University in 1994�95. He received the Japanese Neural Network Society (JNNS) Research Award, the 12th

Telecommunication Advancement Foundation Award (Telecom System Technology prize), and a paper award from IEICE in

1995, 1997, and 2000, respectively. He is a member of JNNS, the Japan Statistical Society, and IEEE.

62

smem algorithm for mixture regression problems

Documents