paper multiphase learning for an interval-based hybrid...

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x1

PAPER

Multiphase Learning for an Interval-Based Hybrid

Dynamical System

Hiroaki KAWASHIMA†a) and Takashi MATSUYAMA†, Members

SUMMARY This paper addresses the parameter estimationproblem of an interval-based hybrid dynamical system (intervalsystem). The interval system has a two-layer architecture thatcomprises a finite state automaton and multiple linear dynam-ical systems. The automaton controls the activation timing ofthe dynamical systems based on a stochastic transition modelbetween intervals. Thus, the interval system can generate andanalyze complex multivariate sequences that consist of temporalregimes of dynamic primitives. Although the interval system isa powerful model to represent human behaviors such as gesturesand facial expressions, the learning process has a paradoxical na-ture: temporal segmentation of primitives and identification ofconstituent dynamical systems need to be solved simultaneously.To overcome this problem, we propose a multiphase parameterestimation method that consists of a bottom-up clustering phaseof linear dynamical systems and a refinement phase of all thesystem parameters. Experimental results show the method canorganize hidden dynamical systems behind the training data andrefine the system parameters successfully.key words: hybrid dynamical system, interval transition, sys-tem identification, clustering of dynamical systems, expectation-maximization algorithm

1. Introduction

Understanding the meaning of human commands andpresenting suitable information to users is one of theprimary objectives of man-machine interaction sys-tems. Most of the previously proposed approaches,therefore, set the goal to classify verbal categories ofdynamical events such as spoken words and gestures.Although this goal is still important, users sometimesfeel frustration when the system gets out of human pro-tocols. Many of the systems utilize only temporal orderof events to estimate user intentions rather than to usetemporal features such as acceleration patterns, tempo,and rhythms, which convey rich information of user’sinternal states.

To recognize detailed meaning of human activitiesand to generate actions with appropriate timing, weconcentrate on modeling temporal structures in verbaland nonverbal events based on the relation among in-tervals. Firstly, we assume that a complex action isconsist of dynamic primitives, which are often referredto as motion primitives [1], movemes [2], visemes [3],

Manuscript received April 4, 2005.Manuscript revised June 10, 2005.Final manuscript received July 13, 2005.†The authors are with the Graduate School of Informat-

ics, Kyoto University, Kyoto-shi, 606-8501 Japan.a) E-mail: [email protected]

motion textons [4], and so on. For example, a cycliclip motion can be described by simple lip motions suchas “open”, “close”, and “remain closed”. Once the setof dynamic primitives is determined, a complex actioncan be partitioned by temporal intervals with the labelsof the primitives. Secondly, we assume that not onlytemporal orders but the duration of the intervals hasnegligible information in human communication. Forinstance, Kamachi et al. [5] suggested that duration offacial actions play an important role for human judg-ments of basic facial expression categories.

In this paper, we introduce an interval-based hy-brid dynamical system (interval system) to describecomplex events based on the structure of intervals. Thesystem has a two-layer architecture consists of a finitestate automaton and a set of multiple linear dynami-cal systems. Each linear dynamical system represents adynamic primitive that corresponds to a discrete stateof the automaton; meanwhile the automaton controlsthe activation timing of the dynamical systems. Thus,the interval system can generate and analyze complexmultivariate sequences that consist of temporal regimesof dynamic primitives (see Fig. 1 for the example).

In spite of the flexibility of the interval system, es-pecially for modeling human behaviors such as gesturesand facial expressions, few applications have exploitedthe system to handle real-world signals. The reason islargely due to the paradoxical nature of the learningprocess: temporal segmentation and system identifica-tion problems need to be solved simultaneously.

The main contribution of this paper is that we pro-pose a multiphase parameter estimation method thatconsists of (1) an agglomerative clustering process ofdynamics based on a distance definition among dy-namical systems, and (2) a refinement process of allthe system parameters including dynamical systemsand finite state automaton based on the approximatedexpectation-maximization (EM) algorithm.

In Section 2, we discuss the characteristics of ourproposed method compared to related works. Section3 describes a detailed model structure and an inferencealgorithm that searches the optimal interval sequencethat provides the highest probability for the given ob-servation. In Section 4 we verify the inference algo-rithm using simulated data. Section 5, proposes a mul-tiphased learning method to identify the interval sys-tem. We evaluate the method using simulated and real

2IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

closed

closed

open

open

close

close

Transition between

dynamical systems

Frame

closed

open

close

Dynamical

system B

Dynamical

system CDynamical

system A

Fig. 1 Example of a generated lip image sequence from theinterval system. Each of three linear dynamical systems modelsa simple lip motion, and an automaton generates a temporalorder and a duration length of each motion.

data in Section 6. Finally, Section 7 concludes with adiscussion of open problems.

2. Related Works

In this section, we introduce related works to explainthe characteristics of our proposed method in the twoaspects: the system definition and its identification.

2.1 Clock-Based and Event-Based State Transition

Previously proposed state transition models can becategorized into clock-based and event-based models.Clock-based models define the state transition in thephysical-time domain by the formulation of differentialor difference equations. For instance, dynamical sys-tems in control theory [6] and recurrent neural networksare clock-based models. In contrast, an event-basedmodel has a sustainable state, and it does not changethe state before an input event occurred; for example,finite state automata and Petri nets are event-basedmodels.

Clock-based models have the advantage of model-ing continuously changing events such as human motionand utterance; however, they have the disadvantagesof being unable to model duration of the state and todescribe temporal structures of concurrent events suchas lip motion and speech. Event-based models have theadvantage of being able to model logical structures suchas relations of temporal order, overlaps, and inclusionamong events [7]. However, event-based models havetypical signal-to-symbol problems; that is, it requiresus to define an event set in advance.

2.2 Hybrid Dynamical Systems

Hybrid dynamical systems (hybrid systems) that in-tegrate clock-based and event-based models are intro-duced to overcome the disadvantages of the two mod-els in a complementary manner. In a hybrid system,an event-based model decides the activation timing ofmultiple clock-based models. The event-based model

provides a solution to represent a logical structure ofdynamic primitives; meanwhile, the clock-based modelsrepresent detailed dynamics in each primitive and alsoprovide a distance definition among the primitives. Be-cause of the high capability of modeling nonlinear andcomplicated events, hybrid systems are currently at-tracting great attention in various fields including con-trols, computer graphics, computer science, and neuralnetworks.

A switching linear dynamical system (SLDS),which switches linear dynamical systems based on thestate transition of the hidden Markov model (HMM)[8], becomes a common approach in modeling complexdynamics such as human motion [9]–[11]. In a SLDS,the transition between constituent subsystems (i.e., lin-ear dynamical systems) is modeled in the same timedomain as the subsystems. Assuming that the systemconsists of a subsystem set Q = {q1, ..., qN}, then theSLDS models the transition from subsystem qi to qj asa conditional probability P (st = qj |st−1 = qi), wheret is synchronized to the continuous state transition ineach subsystem.

In contrast, some hybrid systems model the tran-sition between subsystems independently from thephysical-time domain. Thus, the conditional proba-bility becomes P (sk = qj |sk−1 = qi), where k repre-sents only the temporal order of the transition. Bre-gler et al. [2] proposed a multilevel modeling methodof human gate motion. The model comprises multi-ple linear dynamical systems as its subsystems, andan HMM switches the constituent subsystems. Thestochastic linear hybrid systems [4], [12] are the exten-sion of Breglar’s model (e.g., motion texture [4] is pro-posed for motion generation purpose). Segmental mod-els (SMs) [13] have been proposed in speech recogni-tion fields as the unified model of segmental HMMs [8],segmental dynamical systems, and other segment-basedmodels. This model uses the segment as a descriptor,which has a duration distribution. The segment rep-resents a temporal region, in which one of the statesis activated, and the total system represents phonemesand subwords as a sequence of the segments.

Piecewise linear (PWL) models and piecewiseaffine systems [14]–[16] are also categorized in the classof hybrid systems. A PWL model constructs a nonlin-ear map f(x) by partitioning the domain X ⊂ Rn intoseveral regions X1, ...,XN with polyhedral boundaries,which are referred to as guardlines. In each region,a linear mapping function is defined individually, andthey are switched by the condition of x ∈ X . As aresult, the mapping function becomes nonlinear as awhole. From the viewpoint of switching condition def-initions, we can say that the class of PWL models hasstatic switching conditions, because the switching tim-ing is decided by only predetermined regions. On theother hand, the hybrid systems described in the pre-vious paragraphs have dynamic switching conditions:

KAWASHIMA and MATSUYAMA: MULTIPHASE LEARNING FOR AN INTERVAL-BASED HYBRID DYNAMICAL SYSTEM3

constituent subsystems are switched based on the statetransition of the event-based model (e.g., automaton).In this paper, we concentrate on exploring hybrid sys-tems that have dynamic switching conditions.

2.3 Interval-Based Hybrid Dynamical Systems

An interval-based hybrid dynamical system (intervalsystem) proposed in this paper can be regarded as theextension of SMs. We use the term “intervals” insteadof segments because our motivation is bringing Allen’stemporal interval logic [7], which exploits 13 topologi-cal relations between two intervals (e.g., meets, during,starts with, etc.), into the class of hybrid systems.

The difference from the SMs is that the intervalsystem models not only the interval duration but therelation among multiple intervals. Once the intervalsare explicitly defined, we can fabricate more flexiblemodels to represent complex and concurrent dynam-ics appeared in man-machine interaction; for example,we can describe the temporal relation among concur-rent motions appeared in multiple human parts basedon the intervals [17]. In this paper, we concentrate onmodeling a single signal from a single source for the sim-plest case, and describe the relation of adjacent inter-vals based on the correlation of their duration lengths.

Another advantage of explicit modeling of inter-val relations, and duration lengths, is that it enhancesrobustness against outliers during temporal partition-ing process; in other words, the meta-level knowledgeworks as a constraint to the lower-level process. For in-stance, if the duration distributions of the subsystemsare biased toward a long length, the system will notchange the subsystem before the activation of the sub-system sustains enough length. As a result, the systemimproves the robustness of the temporal segmentation,which is a desirable property for iterative parameterestimation described in the next subsection.

2.4 Identification of Hybrid Dynamical Systems

As we described in Section 1, the difficulty of the learn-ing process that identifies the hybrid system lies in itsparadoxical nature. Let us assume that a large amountof multivariate sequences is given as training data.Then, the identification of the subsystems requires par-titioned and labeled training data; meanwhile, the seg-mentation and labeling processes of training data re-quire an identified set of subsystems. Moreover, thenumber of the subsystems is also unknown in general.Therefore, the parameter estimation problem requiresus to simultaneously estimate temporal partitioning oftraining data (i.e., segmentation and labeling) and theset of subsystems (i.e., the number of the subsystemsand their parameters).

In the following paragraphs, we introduce relevantworks as for the identification of hybrid systems. We

distinguish the cases when the number of subsystemsis known or unknown.

(1) The number of subsystems is known:

The EM algorithm [18] is one of the most commonapproaches to solve this kind of paradoxical problemswhen the number of subsystems is given. The algorithmestimates parameters based on the iterative calculation.In each step, the algorithm fit the model to the trainingdata using the model parameters updated in the previ-ous step. Then, the parameters are updated based onthe result of the current model fitting process.

Many of the hybrid system identification methodsexploit the EM algorithm; for example, SMs [13], mo-tion textures [4], stochastic linear hybrid systems [12](which follows the method in [4]), and SLDSs [9]–[11],[19] applied this algorithm. A well-known identificationtechnique for PWL models is introduced as k-mean likeclustering [14]. This clustering technique can be re-garded as an approximation (i.e., hard clustering) ofthe EM algorithm, which is often referred to as softclustering [20].

Although the convergence of the EM algorithm isguaranteed, it strongly depends on the selection of ini-tial parameters and often converges to a locally opti-mal solution, especially if the model has a large pa-rameter space to search. As the alternative approach,an identification problem of PWL models can be re-casted as a mixed integer programming (MIP), whichfind the globally optimal solution [15], [16]. We can ap-ply the method when the logical switching conditionsbetween subsystems can be transformed into a set ofinequalities; however, it is difficult to transform the dy-namic switching conditions, which are modeled by anautomaton. Another disadvantage of MIP lies in itscomputational complexity. The problem is well knownto be NP-hard in the worst case [14], [15]. For thesereasons, we exploit the EM algorithm rather than theMIP-based methods to identify the interval system.

(2) The number of subsystems is unknown:

Most of previous works in hybrid system identificationassumed that the number of subsystems is given, be-cause the number often corresponds to that of manuallydefined operative conditions in the controlled object. Incontrast, a set of dynamic primitives in human motion(e.g., primitive motion appeared in facial expressions)is undefined in most of cases. Whereas some heuristicsets of dynamic primitives are defined by human ob-servation, they do not guarantee that the set is appro-priate for man-machine interaction systems (e.g., auto-matic recognition of facial motion patterns). Hence, weshould estimate the number of the subsystems (primi-tives) from a given training data set. The problem isoften referred to as the cluster validation problem [20](see Subsection 5.1.4 for details).

In stochastic linear hybrid systems [4], [12], an on-


line clustering based on a greedy algorithm is appliedto determine the number of subsystems (i.e., linear dy-namical systems) and initialize the EM algorithm [4],[12]. The drawback is that it depends on the order ofdata presentation, and is also sensitive to outliers intraining data. Moreover, the algorithm requires decid-ing appropriate thresholds beforehand. As a result, thealgorithm tends to return too many subsystems.

2.5 Multiphase Learning for the Interval Systems

As we described in the previous subsection, the identifi-cation of interval systems requires solving the followingproblems:

1. initialization of the EM algorithm2. estimation of the number of subsystems

In this paper, we propose a multiphase learning methodthat solves the problems above. The key idea is thatwe estimate the initial parameter set and the number ofsubsystems simultaneously based on a bottom-up (ag-glomerative) hierarchical clustering technique, whichwe propose in this paper (see Subsection 5.1 for de-tails). This clustering algorithm is applied to a com-paratively small number of typical training data thatis selected from the given training set. For the secondphase, we exploit the EM algorithm as a refinementprocess for all the system parameters, including param-eters of the automaton. The process is applied to allthe given training data, whereas the clustering processis applied to a selected typical training set. Due to theestimated parameters in the clustering process, the EMalgorithm can be initialized by the parameter set that isrelatively close to the optimum compared to randomlyselected parameters (see Subsection 5.2 for details).

3. Interval-Based Hybrid Dynamical Systems

3.1 System Architecture

An interval system is a stochastic generative modelthat can generate multivariate vector sequences withinterval-based structures. Each generated sequence canbe partitioned by temporally ordered intervals, andeach interval represents a single linear dynamics (seeFig. 2).

The system has a two-layer architecture. The firstlayer has a finite state automaton as an event-basedmodel that models stochastic transitions between in-tervals. The automaton has an ability of generatinginterval sequences, in which each interval is labeled bythe combination of the state of the automaton and theduration. The second layer consists of a set of multiplelinear dynamical systems D = {D1, ..., DN} as clock-based models. Each dynamical system has an ability ofgenerating multivariate vector sequences based on itslinear dynamics (see Subsection 3.2).

Finite state automaton

Interval sequence

Continuous state

Continuous state space

Observation space

x1

y1

x1

x7

x8

x9

x2

y2

x2 x3 x4 x5 x6 x7 x8 x9

x3

y3

x4

y4

x5

y5

x6

y7y8

y9y6

Dynamical System D1

Dynamical System D2

Observation matrix H

Fig. 2 Multivariate sequence generation by an interval-basedhybrid dynamical system.

We define some terms and notations for the inter-val system in the following paragraphs. We simply usethe term “dynamical systems” to denote linear dynam-ical systems.

Observation: Each dynamical system generates ob-servation sequences of multivariate vector y ∈ Rm in am-dimensional observation space.

Continuous states: All the constituent dynamicalsystems are assumed to share a n-dimensional contin-uous state space. Each activated dynamical systemcan generate sequences of continuous (real valued) statevector x ∈ Rn, which are mapped to the observationspace by a linear function that is also shared by all thedynamical systems.

Discrete states: The finite state automaton has adiscrete state set Q = {q1, ..., qN}. The state qi ∈ Qcorresponds to the dynamical system Di.

Duration of discrete states: The duration thatthe automaton sustains a same discrete state becomesa positive integer, because the interval system is a dis-crete time model. To reduce parameter size, we seta minimum duration lmin and a maximum durationlmax. Therefore the duration is defined by τ ∈ T ,{lmin, ..., lmax}.Intervals: The intervals generated by the automatonare defined as a combination of a discrete state and aduration length. We use notation < qi, τ >∈ Q× T torepresent the interval that has state qi and duration τ .


3.2 Linear Dynamical Systems

The state transition in the continuous state space bydynamical system Di, and the mapping from the con-tinuous state space to the observation space is modeledas follows:

xt = F (i)xt−1 + g(i) + ω(i)t

yt = Hxt + υt,

where F (i) is a transition matrix and g(i) is a bias vec-tor. Note that each dynamical system has F (i) and g(i)

individually. H is an observation matrix that defineslinear projection from the continuous state space to theobservation space. ω(i) and υ is the process noise andthe observation noise, which has Gaussian distributionN (0, Q(i)) and N (0, R), respectively. We use the no-tation N (a,B) to denote a Gaussian distribution withaverage vector a and covariance matrix B.

We assumed that all the dynamical systems sharea single continuous state space. The main reason isthat we want to reduce parameters in the interval sys-tem; it is, however, possible to design the system withan individual continuous state space for each dynamicalsystem. In such cases, observation parameters H(i) andR(i) are required for each dynamical system. Althoughthey provide more flexibility in models, a large param-eter space causes problems such as overfitting and highcalculation costs.

Using the formulation and notation mentionedabove, we can consider probability distribution func-tions as follows:

p(xt|xt−1, st = qi) = N (F (i)xt−1, Q(i))

p(yt|xt, st = qi) = N (Hxt, R),

where the probability variable st is an activated discretestate at time t (i.e., dynamical system Di is activated).

Let us assume that the continuous state has aGaussian distribution at each time t. Then, the transi-tion of the continuous state becomes a Gauss-Markovprocess, which is inferable in the same manner asKalman filtering [6]. Thus, the predicted state distri-bution under the condition of observations from 1 tot− 1 is calculated as follows:

p(xt|yt−11 , st = qi) = N (x(i)

t|t−1, V(i)t|t−1)

p(yt|yt−11 , st = qi) = N (Hx

(i)t|t−1,HV

(i)t|t−1H

T + R),

where the average vector x(i)t|t−1 and covariance matrix

V(i)t|t−1 are updated every sampled time t.

Probability calculation: Suppose that the dy-namical system Dj represents an observation sequenceyt

t−τ+1 , yt−τ+1, ..., yt, which has a duration length τ .Then the likelihood score that the system Dj generates

the sequence is calculated by the following equation:

d(j)[t−τ+1,t] , P (yt

t−τ+1| < qj , τ >)

=t∏

t′=t−τ+1

γm p(yt′ |yt′−11 , st′ = qj), (1)

where we assume Gaussian distribution N(x(j)init, V

(j)init)

for the initial state distribution in each interval repre-sented by dynamical system Dj . γm is a volume size ofobservations in the observation space to convert prob-ability density values to probabilities. Note that m isthe dimensionality of observation vectors. γ is assumedto be provided manually based on the size of the obser-vation space (the range of each element in observationvectors). This likelihood score is used in Subsection 3.4to evaluate the fitting of interval sequences to the givenmultivariate sequences.

3.3 Interval Transitions of the Automaton

Suppose that the automaton generates an interval se-quence I = I1, ..., IK , in which the adjacent intervalshave no temporal gaps or overlaps. We assume first-order Markov property for the generated intervals; thatis, the interval Ik depends on only the previous inter-val Ik−1. Therefore, the automaton models not onlythe transition of discrete states but also the correlationbetween the adjacent interval duration lengths.

We model the Markov process of intervals as thefollowing conditional probability:

P (Ik =< qj , τ > |Ik−1 =< qi, τp >),

where it denotes that the interval < qj , τ > occurs afterthe interval < qi, τp > (see Fig. 3). Thus, the proba-bility of the interval < qj , τ > can be calculated usingall the possible intervals that have occurred just beforethe interval < qj , τ > by the following equation:

P (Ik =< qj , τ >)

=∑

<qi,τp>∈Q×TP (Ik =< qj , τ > |Ik−1 =< qi, τp >)

×P (Ik−1 =< qi, τp >).

where the summation for < qi, τp > does not include qj

(i.e., there are no self loops such as qj → qj).The probability P (Ik =< qj , τ > |Ik−1 =<

qi, τp >) requires a large parameter set, which causenot only calculation costs but also the problem of over-fitting during a training phase. We therefore use a para-metric model for the duration distribution:

h(ij)(lk, lk−1) , P (lk, lk−1|sk = qj , sk−1 = qi),

where the two-dimensional distribution models a jointprobability density function of duration lengths in ad-jacent interval pairs that has state qi and qj in this


P( )

Intervals

Fixed rate

(physical time)

Fig. 3 First-order Markov process is assumed for a sequence ofintervals.

order. sk and lk is a probability variable of the dis-crete state and the duration length in the interval Ik,respectively. Note that we can assume an arbitrary den-sity function as h(ij)(lk, lk−1). For convenience, we usea two-dimensional Gaussian distribution normalized inthe range of [lmin, lmax]; thus, the parameter set of thefunction h(ij)(lk, lk−1) becomes {h(i)

m , h(i)v , h

(ij)c }, where

h(i)m and h

(i)v denotes mean and variance of duration

lengths in discrete state qi, and h(ij)c denotes covari-

ance between the adjacent duration lengths in discretestate sequences qiqj (i 6= j).

Using the above notations and assuming that thecurrent discrete state is independent on the durationof the previous interval, we can calculate the intervaltransition probability as follows:

P (Ik =< qj , τ > |Ik−1 =< qi, τp >) = h(ij)(τ, τp)Aij ,

where h(ij)(lk, lk−1) is a one-dimensional Gaussian dis-tribution:

h(ij)(lk, lk−1) , P (lk|sk = qj , sk−1 = qi, lk−1)

=h(ij)(lk, lk−1)∑

lk−1h(ij)(lk, lk−1)

,

and Aij is a discrete state transition probability:

Aij , P (sk = qj |sk−1 = qi),

where i 6= j. Note that, in the conventional discretestate models such as HMMs and SLDSs, the diagonalelements of the matrix [Aij ] define the probabilities ofthe self loops. In the interval system, the diagonal ele-ments are separated from the matrix and defined as du-ration distributions. As a result, the balance betweendiagonal and non-diagonal elements varies due to theduration length of the current state.

3.4 Inference of the Interval System

This section describes a probabilistic inference methodthat searches the optimal interval sequence to repre-sents an input multivariate sequence. The method as-sumes that the interval system have been trained be-forehand. As we will see in the following paragraphs,the method recursively finds the intervals that providethe highest likelihood score with respect to the inputby generating all the possible intervals and selecting

the optimal interval sets at every time t based on a dy-namic programming technique. As a result, the inputsequence is partitioned and labeled by discrete statesthat determine the most likely dynamical system to rep-resent a multivariate sequence in each interval. In otherwords, the inference is a model fitting process that fitsintervals to the given multivariate sequences. The like-lihood of the trained model with respect to the inputsequence is obtained simultaneously as the score of thefitting precision. This inference process is required inthe EM algorithm of the interval system identificationas we will see in Subsection 5.2.

The most naive method for the search is thatfirst calculates the likelihood scores of the model fromall the possible interval sequences independently, andthen finds the best interval sequence that provides thelargest likelihood. However, the calculation cost be-comes order of O(NT ) in this case. To avoid unnec-essary calculation, we exploit a recursive calculationsimilar to HMMs.

Let us first consider the forward algorithm of theinterval system. Suppose that input multivariate datay have been observed from time 1 to t, and the in-terval Ik =< qj , τ > ends at time t. Considering allthe possible intervals that have occurred just beforethe interval Ik, we can decompose the joint probabilityP (Ik =< qj , τ >, yt

1) as the following recursive equa-tion:

P (It =< qj , τ >, yt1)

=X

<qi,τp>∈Q×T{P (It =< qj , τ > |It−τ =< qi, τp >)

×P (It−τ =< qi, τp >, yt−τ1 )

¯p(yτ

t−τ+1|It =< qj , τ >),

where the subscript t in the interval It denotes the end-ing time of the interval. Calculating the joint probabil-ity at every sampled time t, we get probabilities of allthe possible interval sequences.

The forward algorithm often causes numerical un-derflow when the length of the input becomes longer.In the following paragraphs, we describe the Viterbi al-gorithm in which we can take logarithm of the formula-tion to avoid the numerical underflow problem. Let ususe the following notation for the maximum occurrenceprobability of interval < qj , τ > at time t that havebeen searched for the intervals before t− τ :

δt(qj , τ) , maxI1,...,It−τ

P (It =< qj , τ >, yt1).

Suppose that the algorithm have found the most likelyinterval sequence before the previous interval, we canestimate δt(qj , τ) based on the following dynamic pro-gramming similar to the Viterbi algorithm of HMMs:

δt(qj , τ) = max<qi,τp>

{h(ij)(τ, τp)Aijδt−τ (qi, τp)

}d(j)[t−τ+1,t].

(2)Then, we can take logarithm of each term, because thesummation in the forward algorithm have been changed


to maximization in the formulation above.The algorithm requires searching all the possible

intervals for all the current intervals at every sam-pled time t. Therefore, the calculation cost becomesO((NL)2T ) (L = lmax − lmin + 1), which requires agreater cost compared to HMMs, which requires onlyO(N2T ). However, the cost can be reduced drasticallyin the case that the range L is small.

After the recursive search has ended at time T ,we can backtrack to find the optimal interval sequencethat provides the highest probability for the input se-quence in generable interval sequences. Consequently,the maximum likelihood with respect to the given ob-servation sequence and the optimal interval sequence isobtained by max<qj ,τ> δT (qj , τ).

4. Verification of the Inference Algorithms

In this section, we verify the capability of the intervalsystem and the inference algorithms (i.e., the forwardand Viterbi algorithms) described in the previous sec-tion. First, the parameters of an interval system weregiven manually, and interval sequences were generatedby the system for test data. Then, the data was usedas input of the forward and Viterbi algorithms.

To concentrate on the interval-based state tran-sition in the interval system, we set the observationmatrix H = I (unit matrix) and the observation noisecovariance R = O (zero matrix). The number of dis-crete states was N = 3 and the dimensionality of thecontinuous state space was n = 3. The range of theinterval duration was [lmin = 1, lmax = 30]. We setthe probability matrix of the discrete state transitionas A12 = A23 = A31 = 1 and 0 for all the other ele-ments to generate loops such as q1 → q2 → q3 → q1.The initial distribution of the discrete state was 1 forq1 and 0 for the remaining states q2 and q3.

The two dimensional distribution of the durationlength h(ij)(lk, lk−1) had {mean (h(i)

m ), variance (h(i)v )}

of {6, 5}, {12, 30} and {16, 50} for q1, q2, and q3, re-spectively. The covariance (h(ij)

c ) between the pairsof {q1, q2}, {q2, q3} and {q3, q1} was 12, 35 and 15, re-spectively. These covariances were designed to generateinterval sequences that the duration lengths of the in-tervals were monotonically increased in the sequence ofq1, q2, and q3.

The transition matrices, bias vectors, and initialmean vectors of the dynamical systems were as follows:

F (1) =[

0.6 −0.1−0.1 0.2

], g(1) =

[0.20.8

], x

(1)init =

[1.0−1.0

]

F (2) =[

0.3 0.00.0 0.6

], g(2) =

[−0.70.0

], x

(2)init =

[0.31.0

]

F (3) =[

0.5 0.1−0.1 0.3

], g(3) =

[0.6−0.6

], x

(3)init =

[−1.00.0

]

(a) Generated interval sequence

(b) Generated two-dimensional observation sequence

(c) The result of the forward inference

(d) Backtracked interval sequence after the viterbi inference

-1

-0.5

0

0.5

1

0 20 40 60 80 100

Time

Time

Vecto

r ele

ments

Pro

babili

ty o

f dis

cre

te s

tate

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Time

Dis

cre

te s

tate

q3

q1

q2

Time

Dis

cre

te s

tate

q3

q1

q2

Fig. 4 Verification of the forward and Viterbi algorithms. (a)A generated interval sequence by a finite state automaton. (b)A generated observation sequence by three dynamical systemsusing interval sequence of (a). The solid line denotes the firstelement; the dashed line denotes the second element. (c) Theresult of the forward inference using (b) as input. Each linerepresents probability P (st = qj |yt

1) (j = 1, 2, 3). (d) The resultof the backtracked interval sequence after the Viterbi inferenceusing (b) as input. We can see that the original interval sequenceis obtained correctly.

The process noise covariance matrices Q(i) (i = 1, 2, 3)and the covariance matrices V (i) (i = 1, 2, 3) of the ini-tial state distribution were set to zero in the generationstep, and was set to 0.001I (where I is a unit matrix)in the fitting step of intervals.

Figure 4(a) shows a generated interval sequencefrom the finite state automaton. The length of thesequence was T = 100．We see that the duration ofthe intervals increases monotonically in the sequence ofq1, q2, and q3, because we set positive correlation be-tween the adjacent intervals (i.e., q1 to q2 and q2 toq3). Each dynamical system was activated by the labelof each interval. Figure 4(b) shows a generated observa-tion sequence, which corresponds to th continuous statesequence in this experiment (i.e., as a result of obser-vation parameters H = I and R = O). We see that thetime-varying pattern of the observation changes basedon the transition of the intervals.

To verify the algorithms of forward and Viterbiinference, we input a generated sequence shown in Fig.4(b) to the original interval system (i.e., the systemthat generated the input sequence). Figure 4 (c) showsthe result of the forward inference. Each line denotesthe following probabilities of the discrete states under


Time

Dis

cre

te s

tate

q3

q1

q2

Fig. 5 A generated interval sequence using zero covariance du-ration distribution.

the condition of observations from time 1 to t:

P (st = qj |yt1) =

∑τ P (It =< qj , τ >, yt

1)P (yt

1)

Figure 4 (d) shows the result of backtracked intervalsafter the Viterbi algorithm. We see that both of theforward and Viterbi inferences have found optimal in-terval sequences that are consistent with the originalinterval sequence in Fig. 4 (a).

To see the influence of parameters in the durationdistributions, we generated interval sequences with zerocovariance between adjacent intervals. Figure 5 showsan example of the generated sequence. We see that thefirst state sequence of q1, q2, and q3 has non-monotonicchanges of duration (i.e., q3 is shorter than q2, whileq2 is longer than q1), which implies that the durationof each discrete state is decided independently. Conse-quently, the correlation modeling with covariances be-tween the adjacent intervals is necessary to representrhythms and tempo as patterns of duration lengths.

5. Multiphase Learning for the Interval System

The goal of the interval system identification is to es-timate the parameters N and the parameter set Θ ={θi, Aij , πi, h

(i)m , h

(i)v , h

(ij)c |i, j = 1, ..., N, i 6= j}, where

θi = {F (i), g(i), Q(i), x(i)init, V

(i)init} denotes the parameter

set of dynamical system Di. In this paper, we assumethat the observation matrix H and noise covariance ma-trix R are given. In case that the matrices are unknown,the subspace system identification techniques such as[21] can be applied by assuming a single transition ma-trix. Although the subspace identification techniquehas large sensitivity to noise, it successfully identifiesan observation matrix especially from visual feature se-quences [22].

As we described in Subsection 2.4, the parame-ter estimation of the interval system necessitates tosolve two problems: initialization of system parame-ters and estimation of the number of linear dynami-cal systems. To overcome the problem, we propose amultiphase learning method described in the followingparagraphs (see also Fig. 6) to estimate parameters ofthe interval system. The key idea is that we divide theestimation process into two steps: a clustering processof dynamical systems using a typical training data set(i.e., a subset of the given training data set), and arefinement process for all the parameters using all thetraining data.

Hierarchical clustering of

dynamical systems

Refinement process of parameters

via the EM algorithm

The number of dynamical systems

Parameters of dynamical systems

Parameters of dynamical systems

Parameters of an automaton

Fix

Fix

Initialization

Training data set

Typical data set

Fig. 6 A multiphase learning for the interval system.

Clustering of Dynamical Systems: The first stepis a clustering process that finds a set of dynamical sys-tems: the number of the systems and their parameters.This step requires that a set of typical sequences is givenfor the training data, and the sequences have been al-ready mapped to the continuous state space. Then, wepropose an agglomerative hierarchical clustering tech-nique that iteratively merges dynamical systems.

Refinement of the Parameters: The second step isa refinement process of the system parameters based onthe EM algorithm. Whereas the algorithm strongly de-pends on its initial parameters, the previous clusteringstep provides an initial parameter set that is relativelyclose to the optimum compared to a randomly selectedparameter set.

5.1 Hierarchical Clustering of Dynamical Systems

Let us assume that a multivariate sequence yT1 ,

y1, ..., yT is given as a typical training data (we considera single training data without loss of generality), thenwe simultaneously estimate a set of linear dynamicalsystems D (i.e., the number of linear dynamical sys-tem N and their parameters θ1, ..., θN ) with an intervalset I (i.e., segmentation and labeling of the sequence),from the training sample yT

1 . Note that the number ofintervals K is also unknown. We formulate the problemas the search of the linear dynamical system set D andthe interval set I that maximizes the overall likelihoodwith respect to the training data: L = P (yT

1 |I,D).Because the likelihood monotonically increases with anincrease in the number of dynamical systems, we needto determine the right balance between the likelihoodand the number N . A hierarchical clustering approachprovides us an interface such as the history of model fit-ting errors in each merging steps to decide the numberof dynamical systems (see Subsection 5.1.4 for details).

Algorithm 1 shows the details of the clustering pro-cess. This algorithm requires initial partitioning of thetraining sequence, which we refer to as an initial inter-val set Iinit. We will discuss about the initializationin Subsection 5.1.1. In the first step of the algorithm,


Sampling points in the input state sequence

Interval set

Interval set

Identification

Identification

The number of

the systems: N

The number of

the systems: N-1

Nearest dynamical system pair

Merge

Distance space of

dynamical systems

D1

D1

D2

D2

D3

D3

D4

Fig. 7 Hierarchical clustering of dynamical systems.

Algorithm 1 Agglomerative Hierarchical Clustering.for Ii in Iinit do

Di ← Identify (Ii)end forfor all pair(Di, Dj) where Di, Dj ∈ D do

Dist (i, j) ← CalcDistance (Di, Dj)end forwhile N ≥ 2 do

(i∗, j∗) ← arg min(i, j) Dist (i, j)

Ii∗ ← MergeIntervals`Ii∗ , Ij∗

´Di∗ ← Identify (Ii∗ )erase D∗j from DN ← N − 1for all pair

`D∗i , Dj

´where Dj ∈ D do

Dist(i∗, j) ← CalcDistance (Di∗ , Dj)end for

end while

dynamical systems are identified (which is denoted byIdentify in Algorithm 1 from each interval in the initialinterval set Iinit individually based on the identificationmethod proposed in Subsection 5.1.2. Ii is the intervalset that comprises intervals labeled by Di. Then, wecalculate the distances (denoted by CalcDistance) forall the dynamical system pairs based on the distancedefinition described in Subsection 5.1.3. In the follow-ing steps, the nearest dynamical systems are merged(which is denoted by MergeIntervals) iteratively. Twointerval sets that belong to the nearest dynamical sys-tem pair are also merged simultaneously. As a result,all the dynamical systems are merged to a single dy-namical system. Figure 7 shows an example of the stepin which the dynamical system D2 and D4 are merged.

5.1.1 Initialization of the Clustering Algorithm

Because the clustering algorithm iteratively merges in-

tervals in the initial interval set Iinit, the estimation ofthe set Iinit is an important problem. In this paper,we set some conditions to determine Iinit. First, werequire the number of the intervals |Iinit| to be rela-tively smaller than the length of the training sequence,because the calculation cost depends on |Iinit|. There-fore, the simplest partitioning approach that dividesthe training sequence into fixed length small intervalsis undesirable. Second, we require that the initial inter-vals divide stationary pose from motion in the trainingsequence. Third, we require the intervals Ii ∈ Iinit tobe simple in the sense that the patterns in the initialintervals appear frequently in the training sequence.

To satisfy the conditions above, we exploit mono-tonicity of the motion. Many of the motions observedin man-machine interaction, especially visual featuresequences, can be divided by monotonic patterns, be-cause human moves his body using the combination ofintermittent muscle controls. We therefore divide thetraining sequence by zero-crossing points of the first-order difference of feature vectors. Moreover, we set asingle interval to the temporal range, in which the valueremains smaller than the given threshold.

5.1.2 Constrained System Identification

To identify the system parameters from only a smallamount of training data, we need constraints for es-timation of an appropriate dynamics. In this paper,we concentrate on extracting human motion primitivesobserved in such as facial motion, gaits, and gestures;therefore, constraints based on stability of dynamics aresuitable to find motion that converges to a certain statefrom an initial pose. The key idea to estimate stabledynamics is the method that constrains on eigenvalues.If all the eigenvalues are lower than 1, the dynamicalsystem changes the state in a stable manner.

Given a continuous state sequence mapped fromthe observation space, the parameter estimation of atransition matrix F (i) from the sequence of continu-ous state vectors x

(i)b , .., x

(i)e becomes a minimization

problem of prediction errors. Let us use the notationsX

(i)0 = [x(i)

b , ..., x(i)e−1] and X

(i)1 = [x(i)

b+1, ..., x(i)e ], if the

temporal range [b, e] is represented by linear dynamicalsystem Di. Then, we can estimate a transition matrixF (i) by the following equation:

F (i)∗ = arg minF (i)

||F (i)X(i)0 −X

(i)1 ||2

= limδ2→0

X(i)1 X

(i)>0 (X(i)

0 X(i)>0 + δ2I)−1, (3)

where I is the unit matrix and δ is a positive real value.For the constraints on the eigenvalues, we stop the

limit in the Eq. (3) before X(i)>0 (X(i)

0 X(i)>0 + δ2I)−1

converges to a pseudo-inverse matrix of X(i)0 . Using

Gershgorin’s theorem in linear algebra, we can deter-mine the upper bound of eigenvalues in the matrix


from its elements. Suppose f(i)uv is an element in row

u and column v of the transition matrix F (i). Then,the upper bound of the eigenvalues is determined byB = maxu

∑nv=1 |f (i)

uv |. Therefore, we search a nonzerovalue for δ, which controls the scale of elements in thematrix, that satisfies the equation B = 1 via an itera-tive numerical calculation.

5.1.3 Distance Definition between Dynamical Systems

In order to determine a pseudo distance between twolinear dynamical systems, we compare the followingthree approaches: (a) direct comparison of model pa-rameters, (b) decreased likelihood of the overall systemafter the merging of two models [23], and (c) distancemeasure of distributions (e.g., KL divergence [24]).

The approach (a) is not desirable particularly withregards to our bottom-up approach, because a lineardynamical system has a large parameter set that oftencauses overfitting. The approach (b), which is often ex-ploited with stepwize-optimal clustering [20], performswell in the ideal condition, but it requires the calcula-tion cost of likelihood scores for all the combinationof the linear dynamical system pairs. We observedfrom our experience that (c) provides stable dissimi-larity measure for our algorithm with a realistic calcu-lation cost. Therefore, we define the distance betweenlinear dynamical system Di and Dj as an average oftwo asymmetric divergences:

Dist(Di, Dj) = {KL(Di||Dj) + KL(Dj ||Di)}/2,

where each of the divergences is calculated as an ap-proximation of KL divergence:

KL(Di||Dj) ∼ 1Ci

∑

Ik∈Ii

{log P (yek

bk|Di)− log P (yek

bk|Dj)

},

where ybk, ..., yek

is an observation sequence partitionedby interval Ik. Ci is the summation of interval length inthe interval set Ii that is labeled by a linear dynamicalsystem Di. Note that we can calculate the likelihoodsbased on Eq. (1).

5.1.4 The Cluster Validation Problem

The determination of the appropriate number of dy-namical systems is an important problem in real appli-cations. The problem is often referred to as the clustervalidation problem, which remains essentially unsolved.There are, however, several well-known criteria, whichcan be categorized into two types, to decide the num-ber of clusters. One is defined based on the change ofmodel fitting scores, such as log-likelihood scores andprediction errors (approximation of the log-likelihoodscores), during the merging steps. If the score de-creased rapidly, then the merging process is stopped[25]. In other words, it finds knee of the log-likelihood

curve. The other is defined based on information theo-ries, such as minimum description length and Akaike’sinformation criterion. The information-theoretical cri-teria define the evaluation functions that consist of twoterms: log-likelihood scores and the number of free pa-rameters.

Although information-theoretical criteria workwell in simple models, they tend to fail in evaluatingright balance between the two terms, especially if themodel becomes complex and has a large number of freeparameters [26]. Because this problem also arises inour case, we use model fitting scores directly. First, weextract candidates for the numbers of the dynamicalsystems by finding peaks in the difference of model fit-ting errors between current and the previous steps. Ifthe value exceeds a predefined threshold, then the num-ber of dynamical systems in that step is added to thecandidates. We consider that user should finally decidethe appropriate number of the dynamical systems fromthe extracted candidates.

5.2 Refinement of Parameters via the EM algorithm

This subsection describes a refinement process of theoverall system parameter set Θ including the parame-ters of the automaton by exploiting the result of theclustering process. In this process, the number of dy-namical systems N is fixed.

In order to refine the parameters, we apply an it-erative estimation based on the EM algorithm. Thisalgorithm starts from a given initial parameter setΘinit, and repeats two steps: E (expectation) step andM (maximization) step. The E step fits the modelto the given training samples based on the currentmodel parameters, and calculates the expectation oflog-likelihood with respect to the given samples andall the possible fitting instances. The M step updatesthe parameters to maximize the expectation of log-likelihood using the result of the statistics in the E step.After the iteration steps, the algorithm converges to theoptimal parameters, and finds the result of the modelfitting simultaneously.

Although the EM algorithm has been proved toconverge, the obtained solutions are often trapped bylocal optima. This fact becomes critical problem espe-cially if the size of the parameters increases. For a largeparameter models, the algorithm therefore requires aparameter set that is relatively close to the optimumfor the initialization.

The multiphase learning method proposed in thispaper overcomes the initialization problem by exploit-ing the result of the clustering process applied for atypical training data set. From the viewpoint of thelikelihood maximization, the clustering process returnsapproximately estimated parameters for the dynami-cal systems. To initialize remaining parameters such asthe interval transition matrix [Aij ] and interval dura-


tion distributions h(ij)(lk, lk−1), we exploit Viterbi al-gorithm, which is introduced in Subsection 3.4, withsome modification; that is, we set all the transitionprobabilities to be equal: Aij = 1/(N − 1), and wealso assume the interval duration h(ij)(lk, lk−1) to beuniform distributions in the range of [lmin, lmax], wherei, j = 1, ..., N(i 6= j). As a result, Eq. (2) in the Viterbialgorithm becomes

δt(qj , τ) = max<qi,τp>

{δt−τ (qi, τp)} d(j)[t−τ+1,t].

After the initialization above, parameter refine-ment iterations are applied to all the training dataY , {yT1

1 , ..., yTM1 }, where M is the number of the train-

ing sequences and T1, ..., TM are the lengths of eachsequence. In this paper, we approximate the EM al-gorithm, because the original EM algorithm requiresforward/backward inferences that often cause numer-ical underflow. We do not calculate the statistics forall the possible hidden variables (possible interval se-quences), but use only a single interval sequence thatgives the highest log-likelihood in each step. Thus, thetotal refinement process is described as follows:

1. Initialize Θ = Θinit.2. Repeat the following steps while the total likeli-

hood changes greater than a given threshold ε:

E step: Search an optimal interval sequence,which is labeled by the discrete states, for allthe training sequences based on the currentparameter set Θ:

I∗ = arg maxI

log P (Y, I|Θ),

where I∗ is the set of the searched intervalsequences for all the training sequences.

M step: Update the parameters based on thesearched interval sequences in the E step:

Θ = arg maxΘ′

log P (Y, I∗|Θ′).

This algorithm is easily extended to use multipleinterval sequences that give relatively high likelihoodscore rather than to use only an optimal interval se-quence.

6. Experimental Results

6.1 Evaluation on Simulated Data

To evaluate the proposed parameter estimation meth-ods, we first used simulated sequences for trainingdata, because it provides the ground truth of the es-timated parameters. Multivariate sequences were gen-erated from the system that had the same parametersas the system in Subsection 4, except that the state

transition probability A31 was set to zero to gener-ate non-cyclic sequences. In the sequences, three dy-namic primitives were appeared in the temporal orderof D1 → D2 → D3, and only the length of the durationvaried. Figure 8 (a) and (b) shows an example of thegenerated interval sequence and the two-dimensionalobservation sequence, respectively.

The proposed clustering process was evaluated bythe sequence in Fig. 8 (b). First, the initial intervalset was determined by zero-crossing points of the first-order difference as we described in Subsection 5.1.1.Then, the initial dynamical systems were identifiedfrom the interval set. The number of the initial dy-namical systems was N = 6. Figure 8 (c) shows theobtained interval sequences during the clustering pro-cess. The number of the dynamical systems was re-duced from N = 6 to N = 2 by the iterative mergingprocess of nearest dynamical system pairs. In each iter-ation step, two interval sets that belong to the nearesttwo dynamical systems were also merged.

For the evaluation of the refinement process, weused the extracted dynamical systems in the clusteringprocess. We manually selected N = 3 for the number ofdynamical systems, which corresponds to the numberof the original system that generated the typical se-quence. Additional nine sequences were generated forthe training data, and all the generated sequences in-cluding the typical sequence were used for the input ofthe EM algorithm. The algorithm was initialized by theparameters found in the clustering process. Figure 9 (a)shows the searched interval sequences in the E step ofeach iteration step. We see that the partitioning of theintervals gradually converges to almost the same parti-tions as the original interval sequence shown in Fig. 8(a). The solid line in Fig. 9 (b) shows the change ofthe overall log-likelihood of the system. We see that thealgorithm almost converged at the ninth iteration. Thedashed line in Fig. 9 (b) shows the change of the overalllog-likelihood when the duration distribution functionh(ij) was set to be uniform and the adjacent intervalswas modeled to have no correlation. We see that thealgorithm converges to a local optimum in this case.

Consequently, the results of the clustering processbecome inaccurate if inappropriate sequences are se-lected for the typical training data. In spite of theinaccurate parameter estimation in the clustering pro-cess, the evaluation shows that the refinement processapplied to all the training data recovers from the ini-tial error. Especially, the meta-level features, such asmodeling of duration lengths of intervals and relationsbetween intervals, seem to work as constraints to thepartitioning process in the EM algorithm.

6.2 Evaluation on Real Data

To evaluate the capability of the proposed method forreal applications, we applied the clustering method to


Merg

e ite

ration

Dis

cre

te s

tate N = 6

N = 5

N = 4

N = 3

N = 2

-1

0

1

0

10 20 30 40

(b) Two-dimensional observation sequence selected for the typical

data (solid: the first element, dashed: the second element)

(c) Interval sequences partitioned by the clustered dynamical systems

Timevecto

r ele

ments

(a) Interval sequence generated for the typical training data

Time

Time

Dis

cre

te s

tate

q1

q3 q2

Fig. 8 The result of the clustering process applied to the typ-ical training data shown in (b). The process was initialized byN = 6 and reduced the number of the dynamical systems (dis-crete states) by merging the nearest dynamical system pairs ineach iteration steps.

captured video data. A frontal facial image sequencewas captured by 60fps camera during a subject wassmiling four times. Facial feature points were trackedby the active appearance model [27], [28], and eight fea-ture points around the right eye were extracted. Thelength of the sequence was 1000. We then applied theclustering method to the obtained 16-dimensional vec-tor sequence that comprised x- and y-coordinates of thefeature points (both coordinate coefficients were plottedtogether in Fig. 10 (a)). Figure 10 (b) shows the over-all model fitting errors between the original sequence Yand generated sequences Y gen(N) by the extracted Ndynamical systems. The error in each merge iterationstep was calculated by the Euclidean norm: Err(N) =

||Y − Y gen(N)|| =√∑1000

t=1 ||yt − ygen(N)t||2, which isthe approximation of the overall log-likelihood score.The candidates of the number of the dynamical systemswere determined as N = 3 and N = 6 by extractingthe steps in which the difference Err(N − 1)−Err(N)exceeded given threshold (we used 0.01 in this experi-ment). The difference Err(N − 1)− Err(N) is shownin Fig. 10 (c).

Figure 10 (d) shows the intervals extracted from

it=1

it=2

it=3

it=4

it=5

it=6

it=7

it=8

it=9

EM

ite

ra

tio

n

-14000

-12000

-10000

-8000

-6000

-4000

-2000

0

2 31 4 5

6 7

8 9 10 11

Log-lik

elihood

Number of EM iteration

(b) The change of the overall log likelihood during the EM iteration

(a) Searched interval sequence in each E step during the iteration

(it: # of iteration)

with

duration model

without

duration model

Fig. 9 The result of the refinement process via the EM algo-rithm using all the training data.

the clustering process when the number of dynamicalsystems was N = 6. Figure 10 (e) shows the gener-ated sequence from the extracted six dynamical sys-tems, where each dynamical system was activated basedon the partitioned intervals in Fig. 10 (d). The domi-nant dynamical systems D3 and D4 correspond to theintervals in which the eye had been remain closed andopen, respectively. The other dynamical systems suchas D5 correspond to the eye blink motion.

Consequently, the history of model fitting errorsduring the clustering process helps us to decide the ap-propriate number of dynamical systems.

7. Conclusion

This paper proposed a multiphase learning for theinterval-based hybrid dynamical system that comprisesa finite state automaton as an event-based model andmultiple linear dynamical systems as clock-based mod-els. The experimental results on the simulated andreal data show that the proposed parameter estimationmethod successfully finds a set of dynamical systemsthat is embedded in the training data and the transitionprobabilities between the dynamics with a modeling ofadjacent interval duration lengths.

For future works, we need to discuss the open prob-


(a) Tracked feature points around the right eye

(b) Errors between original and generated sequences

(c) Differences between current and the previous errors

Time

Ve

cto

r e

lem

en

ts

-0.14

-0.12

-0.10

-0.08

-0.06

-0.04

0 200 400 600 800 1000

0.10

0.12

0.14

0.16

0.18

0.20

0 10 20 30 40

Err

(N)

Number of systems (N)

Number of systems (N)

0

0.01

0.02

0.03

0.04

0 10 20 30 40

Err

(N-1

) -

Err

(N)

threshold = 0.01

qi 1

23456

(d) Intervals partitioned by the dynamical systems (N = 6)

Ve

cto

r e

lem

en

ts

-0.14

-0.12

-0.10

-0.08

-0.06

-0.04

Time 0 200 400 600 800 1000

Time

(e) Generated sequences from the clustered dynamical

systems (N=6)

Fig. 10 Clustering results of the feature sequence around theright eye during a subject was smiling four times.

lems of interval systems including the following areas.

(1) Model selection problems:

The selection of specific clock-based models depends onthe nature of features and the design policies of users.We chose linear dynamical systems as clock-based mod-els because linear dynamical systems are appropriatemodels to represent continuously changing human mo-tion. HMMs can be more reasonable choice for mod-eling non-linear time-varying patterns such as conso-nants in human speech. Moreover, some motion genera-tion researches design the total system as a clock-based

system such as non-linear dynamical systems [29]．Weneed further discussion to give a guideline to selectmodels.

(2) Modeling structures in multiple streams:

Modeling relations among concurrent multiple streams(e.g., temporal relations of motions among facial parts)and relations among multimodal data (e.g., lip motionand speech data) is one of the important objectives forthe interval-based representation. A general modelingmethod of relations among intervals will be required torepresent concurrent events.

(3) Modeling transitional process between dynamicalsystems:

The state in the continuous state space often changesdiscontinuously when the automaton changes the dy-namical systems. To model coarticulated dynamicssuch as phonemes in speech data, we need transitionalprocess modeling between two dynamical systems (e.g.,the modulation of dynamics by the preceding dynam-ics).

(4) Grammatical structures of dynamics:

We exploited a simple finite state automaton for anevent-based model in order to concentrate on model-ing human body actions and motions, which have rela-tively simple grammatical structures compared to nat-ural languages. To model languages such as humanspeech and signs, more complex grammatical structuressuch as N-gram models and context free grammars willbe required with its parameter estimation method.

Acknowledgments

This work is in part supported by Grant-in-Aid for Sci-entific Research of the Ministry of Education, Culture,Sports, Science and Technology of Japan under the con-tract of 13224051 and 16700175.

References

[1] S. Nakaoka, A. Nakazawa, K. Yokoi, and K. Ikeuchi,“Leg motion primitives for a dancing humanoid robot,”Proc. IEEE Intl. Conference on Robotics and Automation,pp.610–615, 2004.

[2] C. Bregler, “Learning and recognizing human dynamics invideo sequences,” Proc. IEEE Conference on Computer Vi-sion and Pattern Recognition, pp.568–574, 1997.

[3] A.V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy,“Dynamic bayesian networks for audio-visual speech recog-nition,” EURASIP Journal on Applied Signal Processing,vol.2002, no.11, pp.1–15, 2002.

[4] Y. Li, T. Wang, and H.Y. Shum, “Motion texture: A two-level statistical model for character motion synthesis,” SIG-GRAPH, pp.465–472, 2002.

[5] M. Kamachi, V. Bruce, S. Mukaida, J. Gyoba,S. Yoshikawa, and S. Akamatsu, “Dynamic properties in-fluence the perception of facial expressions,” Perception,vol.30, pp.875–887, 2001.


[6] B.D.O. Anderson and J.B. Moor, Optimal Filtering,Prentice-Hall, 1979.

[7] J.F. Allen, “Maintaining knowledge about temporal inter-val,” Commun. ACM, vol.26, no.11, pp.832–843, 1983.

[8] H.D. Huang, Y. Ariki, and M.A. Jack, Hidden Markov Mod-els for Speech Recognition, Edinburgh Univ., 1990.

[9] Z. Ghahramani and G.E. Hinton, “Switching state-spacemodels,” Technical Report CRG-TR-96-3, Dept. of Com-puter Science, University of Toronto, 1996.

[10] V. Pavlovic, J.M. Rehg, T. Cham, and K.P. Murphy, “Adynamic bayesian network approach to figure tracking usinglearned dynamic models,” Proc. IEEE Intl. Conference onComputer Vision, pp.94–101, Sept. 1999.

[11] V. Pavlovic, J.M. Rehg, and J. MacCormick, “Learningswitching linear models of human motion,” Proc. of NeuralInformation Processing Systems, 2000.

[12] H. Balakrishnan, I. Hwang, J.S. Jang, and C.J. Tomlin,“Inference methods for autonomous stochastic linear hybridsystems,” Lecture Notes in Computer Science 2993, HybridSystems: Computation and Control, pp.64–79, 2004.

[13] M. Ostendorf, V. Digalakis, and O.A. Kimball, “FromHMMs to segment models: A unified view of stochasticmodeling for speech recognition,” IEEE Trans. Speech Au-dio Process, vol.4, no.5, pp.360–378, 1996.

[14] G. Ferrari, M. Muselli, D. Liberati, and M. Morari, “Aclustering technique for the identification of piecewise affinesystems,” Automatica, vol.39, pp.205–217, 2003.

[15] J. Roll, A. Bemporad, and L. Ljung, “Identification ofpiecewise affine systems via mixed-integer programming,”Automatica, vol.40, pp.37–50, 2004.

[16] J.H. Kim, S. Hayakawa, T. Suzuki, K. Hayashi, S. Okuma,N. Tsuchida, M. Shimizu., and S. Kido, “Modeling of driv-ier’s collision avoidance behavior based on piecewise linearmodel,” Proc. IEEE Intl. Conference on Decision and Con-trol, vol.3, pp.2310–2315, 2004.

[17] M. Nishiyama, H. Kawashima, and T. Matsuyama, “Facialexpression recognition based on timing structures in faces,”Computer Vision and Image Media Technical Report (IPSJSIG-CVIM 149), vol.2005, no.38, pp.179–186, 2005.

[18] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximumlikelihood from incomplete data via the EM algorithm,” J.R. Statist. Soc. B, vol.39, pp.1–38, 1977.

[19] N. Yamada, T. Suzuki, S. Inagaki, and S. Sekikawa, “On theparameter estimation of stochastic switched ARX modeland it’s application,” Proc. the 18th Workshop on Circuitsand Systems in Karuizawa, pp.269–274, 2005.

[20] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classifica-tion, 2nd ed., Wiley-Interscience, 2000.

[21] P.V. Overschee and B.D. Moor, “A unifying theoremfor three subspace system identification algorithms,” Au-tomata, vol.31. no.12, pp.1853–1864, 1995.

[22] G. Doretto, A. Chiuso, Y.N. Wu, and S. Soatto, “Dynamictextures,” Intl. J. Comput. Vis., vol.51, no.2, pp.91–109,2003.

[23] T. Brants, “Estimating HMM topologies,” Logic and Com-putation, 1995.

[24] B.H. Juang and L.R. Rabiner, “A probabilistic distancemeasure for hidden markov models,” AT & T TechnicalJournal, vol.64, no.2, pp.391–408, 1985.

[25] D.K. Panjwani and G. Healey, “Markove random field mod-els for unsupervised segmentation of textured color images,”IEEE Trans. Pattern Analy. Mach. Intell., vol.17, no.10,pp.939–954, 1995.

[26] D.A. Langan, J.W. Modestino, and J. Zhang, “Clustervalidation for unsupervised stochastic model-based imagesegmentation,” IEEE Trans. Image Process., vol.7, no.2,pp.180–195, 1998.

[27] T.F. Cootes, G.J. Edwards, and C.J. Taylor, “Active ap-pearance model,” Proc. European Conference on ComputerVision, vol.2, pp.484–498, 1998.

[28] M.B. Stegmann, B.K. Ersboll, and R. Larsen, “FAME -a flexible appearance modelling environment,” Informaticsand Mathematical Modelling, Technical University of Den-mark, 2003.

[29] M. Okada, K. Tatani, and Y. Nakamura, “Polynomial de-sign of the nonlinear dynamics for the brain-like informa-tion processing of whole body motion,” Proc. IEEE Intl.Conference on Robotics and Automation, 2002.

Hiroaki Kawashima received theB.Eng. degree in electrical and electronicengineering and the M.S. degree in in-formatics from Kyoto University, Japanin 1999 and 2001, respectively. He iscurrently an assistant professor in theDepartment of Intelligence Science andTechnology, Graduate School of Informat-ics, Kyoto University. His research inter-ests include time-varying pattern recogni-tion, human-machine interaction, and hy-

brid dynamical systems. He is a member of the IEEE ComputerSociety.

Takashi Matsuyama received theB.Eng., M.Eng., and D.Eng. degrees inelectrical engineering from Kyoto Univer-sity, Japan, in 1974, 1976, and 1980, re-spectively. He is currently a professorin the Department of Intelligence Scienceand Technology, Graduate School of In-formatics, Kyoto University. His researchinterests include knowledge-based imageunderstanding, computer vision, cooper-ative distributed vision, 3D video, and

human-machine interaction. He received six best paper awardsfrom Japanese and international academic societies including theMarr Prize at the International Conference on Computer Visionin 1995. He is a fellow of International Association for PatternRecognition and a member of the Information Processing Societyof Japan, the Japanese Society for Artificial Intelligence, and theIEEE Computer Society.

paper multiphase learning for an interval-based hybrid...

Documents