snippets of system identification in computer vision

11
SNIPPETS OF SYSTEM IDENTIFICATION IN COMPUTER VISION Stefano Soatto *,1 Alessandro Chiuso **,2 * University of California, Los Angeles – CA 90095, [email protected] ** Universit`a di Padova – Italy 35131, [email protected] Abstract: In this paper we illustrate the use of identification-theoretic techniques in computer vision, and hint at some open problems. Keywords: System identification, computer vision, dynamic scene analysis, visual recognition, dynamic textures, human gaits, dynamic vision. 1. INTRODUCTION The sense of vision plays a crucial role in the life of primates, allowing them to infer spatial properties of the environment and perform crucial tasks for survival. Primates use vision to explore unfamiliar surroundings, negotiate physical space with one another, detect and recognize a prey at a distance, fetch it, all with seemingly little effort. To this day, engineered systems are far from exhibiting similar capabilities, and endowing a computer with a “sense of vision” is proving to be a formidable task: 3 How can we take a collection of digital images (i.e. arrays of positive numbers) as shown in Table 3, and tell whether we are looking at an apple or a person, and whether she is wearing a hat? This is no easy task, since the measured images depend upon the geometry of the scene (which is unknown), its reflectance properties (also un- known), its motion (unknown) and the illumina- tion of the scene (unknown). While some of these 1 Supported by AFOSR F49620-03-1-0095 and ONR N00014-02-1-0720. 2 Supported by ASI and MURST. 3 Although primates appear to be able to “see” effort- lessly, it is interesting to notice that about half of their cerebral cortex is devoted to processing visual information (Felleman and van Essen, 1991). x y 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 188 186 188 187 168 130 101 99 110 113 112 107 117 140 153 153 189 189 188 181 163 135 109 104 113 113 110 109 117 134 147 152 190 190 188 176 159 139 115 106 114 123 114 111 119 130 141 154 190 188 188 175 158 139 114 103 113 126 112 113 127 133 137 151 191 185 189 177 158 138 110 99 112 119 107 115 137 140 135 144 193 183 178 164 148 134 118 112 119 117 118 106 122 139 140 152 185 181 178 165 149 135 121 116 124 120 122 109 123 139 141 154 175 176 176 163 145 131 120 118 125 123 125 112 124 139 142 155 170 170 172 159 137 123 116 114 119 122 126 113 123 137 141 156 171 171 173 157 131 119 116 113 114 118 125 113 122 135 140 155 174 175 176 156 128 120 121 118 113 112 123 114 122 135 141 155 176 174 174 151 123 119 126 121 112 108 122 115 123 137 143 156 175 169 168 144 117 117 127 122 109 106 122 116 125 139 145 158 179 179 180 155 127 121 118 109 107 113 125 133 130 129 139 153 176 183 181 153 122 115 113 106 105 109 123 132 131 131 140 151 180 181 177 147 115 110 111 107 107 105 120 132 133 133 141 150 181 174 170 141 113 111 115 112 113 105 119 130 132 134 144 153 180 172 168 140 114 114 118 113 112 107 119 128 130 134 146 157 186 176 171 142 114 114 116 110 108 104 116 125 128 134 148 161 185 178 171 138 109 110 114 110 109 97 110 121 127 136 150 160 Table 1. An “image” (top) is an array of posi- tive numbers (bottom, subsampled) whose val- ues are influenced by many “nuisance factors” including illumination and material properties of the scene. unknowns are not necessarily of interest, they all affect the measurements, and therefore the infer- ence process has to be invariant with respect to these “nuisance” variables. It is easy to convince oneself that from images alone, no matter how many, it is impossible to recover a physically cor- rect model of the geometry (shape), photometry (reflectance) and dynamics (motion) of the scene. In this sense, visual perception per se is an ill-

Upload: independent

Post on 29-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

SNIPPETS OF SYSTEM IDENTIFICATION INCOMPUTER VISION

Stefano Soatto ∗,1 Alessandro Chiuso ∗∗,2

∗University of California, Los Angeles – CA 90095,[email protected]

∗∗Universita di Padova – Italy 35131, [email protected]

Abstract: In this paper we illustrate the use of identification-theoretic techniquesin computer vision, and hint at some open problems.

Keywords: System identification, computer vision, dynamic scene analysis, visualrecognition, dynamic textures, human gaits, dynamic vision.

1. INTRODUCTION

The sense of vision plays a crucial role in the life ofprimates, allowing them to infer spatial propertiesof the environment and perform crucial tasks forsurvival. Primates use vision to explore unfamiliarsurroundings, negotiate physical space with oneanother, detect and recognize a prey at a distance,fetch it, all with seemingly little effort. To this day,engineered systems are far from exhibiting similarcapabilities, and endowing a computer with a“sense of vision” is proving to be a formidabletask: 3 How can we take a collection of digitalimages (i.e. arrays of positive numbers) as shownin Table 3, and tell whether we are looking at anapple or a person, and whether she is wearing ahat?

This is no easy task, since the measured imagesdepend upon the geometry of the scene (whichis unknown), its reflectance properties (also un-known), its motion (unknown) and the illumina-tion of the scene (unknown). While some of these

1 Supported by AFOSR F49620-03-1-0095 and ONRN00014-02-1-0720.2 Supported by ASI and MURST.3 Although primates appear to be able to “see” effort-lessly, it is interesting to notice that about half of theircerebral cortex is devoted to processing visual information(Felleman and van Essen, 1991).

x

y

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

188 186 188 187 168 130 101 99 110 113 112 107 117 140 153 153189 189 188 181 163 135 109 104 113 113 110 109 117 134 147 152190 190 188 176 159 139 115 106 114 123 114 111 119 130 141 154190 188 188 175 158 139 114 103 113 126 112 113 127 133 137 151191 185 189 177 158 138 110 99 112 119 107 115 137 140 135 144193 183 178 164 148 134 118 112 119 117 118 106 122 139 140 152185 181 178 165 149 135 121 116 124 120 122 109 123 139 141 154175 176 176 163 145 131 120 118 125 123 125 112 124 139 142 155170 170 172 159 137 123 116 114 119 122 126 113 123 137 141 156171 171 173 157 131 119 116 113 114 118 125 113 122 135 140 155174 175 176 156 128 120 121 118 113 112 123 114 122 135 141 155176 174 174 151 123 119 126 121 112 108 122 115 123 137 143 156175 169 168 144 117 117 127 122 109 106 122 116 125 139 145 158179 179 180 155 127 121 118 109 107 113 125 133 130 129 139 153176 183 181 153 122 115 113 106 105 109 123 132 131 131 140 151180 181 177 147 115 110 111 107 107 105 120 132 133 133 141 150181 174 170 141 113 111 115 112 113 105 119 130 132 134 144 153180 172 168 140 114 114 118 113 112 107 119 128 130 134 146 157186 176 171 142 114 114 116 110 108 104 116 125 128 134 148 161185 178 171 138 109 110 114 110 109 97 110 121 127 136 150 160

Table 1. An “image” (top) is an array of posi-tive numbers (bottom, subsampled) whose val-ues are influenced by many “nuisance factors”including illumination and material properties

of the scene.

unknowns are not necessarily of interest, they allaffect the measurements, and therefore the infer-ence process has to be invariant with respect tothese “nuisance” variables. It is easy to convinceoneself that from images alone, no matter howmany, it is impossible to recover a physically cor-rect model of the geometry (shape), photometry(reflectance) and dynamics (motion) of the scene.In this sense, visual perception per se is an ill-

posed problem. However, the inference problemmay become well-posed within the context of aspecific task. For instance, while one cannot in-fer “the” (physically correct) model of the scenein Figure 2.3, one can infer a representation ofthe scene that can be sufficient to support, forinstance, control tasks, or recognition tasks. Afterall, even if we cannot infer the correct model ofsteam, or smoke or fire – no matter what compu-tational device we have available, it be a computeror a brain – we sure know how to recognize smokeor fire when we see it.

In this paper we seek to infer models of dynamicvisual processes for the purpose of classificationand recognition tasks. For instance, from a num-ber of sequences of images of fire, smoke or steam,we want to identify a model that can be usedto then recognize, say, fire in a new sequence.The same goes for human motion: how can weinfer a model of various gaits, such as walk, run,jump, limp, so that we can for instance detect alimping person from afar? We will first addressthe simplest possible classes of models, workingunder the assumption that the underlying dataexhibit some form of stationarity. Here the cur-rent body of knowledge in system identificationhas already played a key role in a number ofapplications, as we describe in Section 2. Even forsuch simple models, however, recognition and clas-sification tasks remain largely an open problem,which we discuss in Section 3. Imposing simplermodels and inferring the model parameters as wellas the domain where they are satisfied within aprescribed accuracy leads to a segmentation prob-lem, which is described in Section 4. Extensionsto more complex models are countless, and so aretheir applications. We therefore highlight someopen problems in system identification that aremotivated by extensions of the research describedin this paper in Section 5.

2. MODELING DYNAMIC VISUALPROCESSES

Since a physically correct model of the geometry,photometry and dynamics of a scene cannot berecovered from visual information alone, it is cus-tomary to make assumptions on some of the un-knowns, in order to infer the others. For instance,in stereo reconstruction one often exploits theassumption that the scene exhibits Lambertianreflection in order to recover shape (see (Ma etal., 2003) and references therein). Naturally, if theassumption is violated, the resulting inference ismeaningless, which results in visual illusions thatboth biological and artificial systems are subjectto (Chiuso et al., 2000).

Therefore, here we take a different approach:rather than making prior assumptions and at-tempt to use visual information to recover a phys-ical model of the scene, we seek to recover a sta-tistical model of visual data directly at the outset.Hopefully, this model will support recognition andclassification tasks, which we discuss in Section 3.To this end, we represent visual data as the outputof a dynamical system driven by realizations ofa process drawn from an unknown distribution.Inferring a model of the scene then consists ofinferring the model parameters as well as theinput distribution. In the next two subsections weillustrate the simplest possible model applied firstto pixel intensity, resulting in so-called “dynamictextures”, and then to the position of a number oflandmark “features”, as a model of human gaits.

2.1 Dynamic textures

Let I(t) ∈ Rk×lt=1...τ be a sequence of images.Suppose that at each instant of time t we can mea-sure a noisy version of the image, y(t) = I(t)+w(t)where w(t) is an independent and identically dis-tributed (IID) sequence drawn from a distributionpw(·) resulting in a positive measured sequencey(t) ∈ Rm, t = 1 . . . τ 4 , where m = k × l. Wesay that the sequence I(t) is a (linear) dynamictexture if there exists a set of n spatial filtersφα, α = 1 . . . n and a stationary distributionq(·) such that, calling z(t) .= φ(I(t)), z(t) canbe modeled as an ARMA process excited by thewhite noise v(t), distributed according to q(·).Therefore, a dynamic texture is associated to (asecond-order stationary process and, therefore) astate space model with unknown input distribu-tion

x(t + 1) = Ax(t) + Bv(t)z(t) = Cx(t) + Dv(t)y(t) = ψ(z(t)) + w(t)

(1)

with x(0) = x0, v(t) IID∼ q(·) unknown, w(t) IID∼pw(·) given, and I(t) = ψ(z(t)) where ψ(φ(I)) =I. One can obviously extend the definition toan arbitrary non-linear model of the form x(t +1) = f(x(t), v(t)), leading to the concept of non-linear dynamic texture.

The definition of dynamic texture above, whichwas proposed in (Doretto et al., 2003), entails achoice of filters φα, α = 1 . . . n. These filters arealso inferred as part of the identification processfor a given dynamic texture. There are severalcriteria for choosing a suitable class of filters,ranging from biological motivations to computa-tional efficiency. In the trivial case, we can takeφ to be the identity, and therefore look at the

4 This distribution can be inferred from the physics of theimaging device.

dynamics of individual pixels x(t) = I(t) in (1).However, in texture analysis the dimension of thesignal is huge (tens of thousands components) andthere is a lot of redundancy. Hence we view thechoice of filters as a dimensionality reduction step,and seek for a decomposition of the image in thesimple (linear) form

I(t) =n∑

i=1

xi(t)θi.= Cx(t) (2)

where C = [θ1, . . . , θn] and θ can be an or-thonormal or overcomplete basis of L2, a set ofprincipal components, or a wavelet filter bank.In the simples yet effective solution C can beestimated using linear techniques such as prin-cipal component analysis. The advantage of thisapproach, besides simplicity, is that it allows toeffectively reduce complexity via a data-tailoredconstruction of basis function. Experimental re-sults show that 20 to 30 principal componentsyield synthesized textures which are practicallyindistinguishable from the original ones. Stan-dard approaches based on filter banks requiremore coefficients to obtain comparable results;for instance, dynamic textures based on Fourierand Gabor filters have been shown in (Zhu andWang, 2002).

2.2 Human gaits

Instead of modeling the pixel intensities, one couldmodel the position in space of a number of land-mark points, for instance the joints of an articu-lated body. We start from the assumption that asequence of joint angle trajectories y(t), t = 1 . . . τis a realization from a second-order stationarystochastic process, i.e.

x(t + 1) = Ax(t) + Bv(t)y(t) = Cx(t) + Dv(t)

(3)

where v(t) is a normalized white noise sequence.Although this may seem like a severely restrictiveassumption, we show that it is sufficient to char-acterize models that are general enough for thepurpose of recognition.

2.3 Inference

The problem of going from data to models isthe usual system identification problem. Severalapproaches have been proposed in the literatureranging from standard prediction error methods(Ljung, 1987), to iterative solutions based onexpectation-maximization which seem to have ob-tained a fair amount of attention in the learningcommunity, to the more recent subspace methods(Overschee and Moor, 1993).

For the case of dynamic textures, due to the di-mension of the signal (76,800 for video at half-resolution), one cannot apply standard identifica-tion algorithms. A dimensionality reduction stepis a must as we have discussed at the end of section2.1. Even after this reduction step the signal ishigh dimensional and therefore subspace methodsseems to be best way to go. If we restrict ourselvesto first-order AR processes, the following algo-rithm yields a simple and yet effective solution:Let Y τ

1.= [y(1), . . . , y(τ)] ∈ Rm×τ with τ > n,

and similarly for Xτ1 and W τ

1 , and notice that

Y τ1 = CXτ

1 + W τ1 ; C ∈ Rm×n; CT C = I (4)

by our assumptions. Now let Y τ1 = UΣV T ; U ∈

Rm×n; UT U = I; V ∈ Rτ×n, V T V = Ibe the singular value decomposition (SVD) withΣ = diagσ1, . . . , σn, and consider the problemof finding the best estimate of C in the senseof Frobenius: C(τ), X(τ) = arg minC,Xτ

1‖W τ

1 ‖F

subject to (4). It follows immediately from thefixed-rank approximation property of the SVD(Golub and Loan, 1989) that the unique solutionis given by

C(τ) = U X(τ) = ΣV T (5)

and A can be determined uniquely, again in thesense of Frobenius, by solving the following linearproblem: A(τ) = arg minA ‖Xτ

1 −AXτ−10 ‖F which

is trivially done in closed form using an estimateof X from (5):

A(τ) = ΣV T D1V (V T D2V )−1Σ−1 (6)

where D1 =[

0 0Iτ−1 0

]and D2 =

[Iτ−1 0

0 0

].

Notice that C(τ) is uniquely determined up to achange of sign of the components of C and x. Alsonote that

E[x(t)xT (t)] ≡ limτ→∞

τ∑

k=1

x(t + k)xT (t + k) ' Σ2

(7)which is diagonal. Finally, the sample input noisecovariance Q can be estimated from

Q(τ) =1τ

τ∑

i=1

v(i)vT (i) (8)

where v(t) .= x(t+1)− A(τ)x(t). In the algorithmabove we have assumed that the number of prin-cipal components n was given. In practice, thisneeds to be inferred from the data. In practicethis is done from the singular values σ1, σ2, . . . ,by choosing n as the cutoff where the value ofσ drops below a threshold. A threshold can alsobe imposed on the difference between adjacentsingular values.

Identifying a model from a sequence of 100 framestakes about 5 minutes in MATLAB on a 1GHzpentiumr III PC. Synthesis can be performed in

0 10 20 30 40 50 60 70 80 90 1000

5000

10000

15000

n

|| Y

− Y

hat ||

0 10 20 30 40 50 60 70 80 90 1000

1000

2000

3000

4000

5000

6000

n

|| Y

− Y

hat ||

0 10 20 30 40 50 60 70 80 90 1000

500

1000

1500

2000

2500

n

|| Y

− Y

hat ||

20 30 40 50 60 70 80 90 10048

50

52

54

56

58

60

62

64

tau

||y(t

+1)

− y

hat(t

+1)

||

20 30 40 50 60 70 80 90 1003.5

4

4.5

5

5.5

6

tau

||y(t

+1)

− y

hat(t

+1)

||

20 30 40 50 60 70 80 90 1006

8

10

12

14

16

18

20

22

24

tau

||y(t

+1)

− y

hat(t

+1)

||

(a) (b) (c)

Fig. 1. Compression error as a function of the dimension of the state space n (top row), and extrapolationerror as a function of the length of the training set τ (bottom row). Column (a) river sequence,column (b) smoke sequence, column (c) toilet sequence.

(a)

(b)

(c)

Fig. 2. (a) river sequence, (b) smoke sequence, (c) toilet sequence. For each of them the top row aresamples of the original sequence, the bottom row shows samples of the extrapolated sequence. All thedata are available on-line at http://www.vision.ucla.edu/projects/dynamic-textures.html

real time. In our implementation we have used τbetween 50 and 150, n between 20 and 50 andk between 10 and 30. Figures 2.3 to 2.3 showthe behavior of the algorithm on a representativeset of experiments. In each case of Figure 2.3,on the first row we show a few images from theoriginal dataset, on the second row we show afew extrapolated samples. Figure 2.3 shows theoverall compression error as a function of thedimension of the state space (top row) as well asthe prediction error as a function of the lengthof the learning set (bottom row). The predictionerror is computed by using the first τ images toidentify a model, then using the model to predictthe image at τ + 1, finally comparing the resultwith the actual image measured at τ + 1. Theplot shows the error between the predicted andthe measured images at τ + 1 as a function of τ .A simple histogramming of the residual shows

that it deviates considerably from a Gaussiandensity, with considerable weight at the tails.

3. CLASSIFICATION AND RECOGNITIONOF DYNAMIC VISUAL PROCESSES

One approach to classification and recognition isto look directly to the data sequences. These canbe regarded as samples from some distributionand therefore classical concepts of discrepancymeasures such as Kullback-Leibler divergence,Battacharyya or Hellinger, could be employed.The space of probability distributions, suitablyrestricted, can be given the structure of a differen-tiable manifold. One could endow these manifoldswith a Riemannian structure, and define distancesand probability distributions, which are the basisof standard techniques such as Bayes classificationor likelihood ratios. This, however, has several

drawbacks. First, one has to take into accountthat different datasets may have different length.In fact, the probability distribution of an n-vectorlives on a different space than that of an m-vectorif m 6= n. Therefore, it seems more appropriate tolook at invariants of the process itself. As we areworking with second-order stochastic processes,these are the spectrum, the covariance functionor a spectral factor or, in other words, an ARMAmodel.

ARMA models, learned from data as describedin the previous section, do not live on a linearspace. The matrix A is constrained to be stable,the matrix C has non-trivial geometric structureafter a canonical realization is chosen (for in-stance, its columns may form an orthogonal set).More in general, the model lives in the quotientspace of coordinate transformations of the statespace. That leads us to endowing Grassmann andStiefel manifolds with a metric and probabilisticstructure that, although possible, is not straight-forward (although see the work of (Hannan andDeistler, n.d.) for references on how to endowtransfer functions with the structure of a differ-entiable manifold.)

While a simple probabilistic approach to classifi-cation based on likelihood ratios with respect toa simple probability distribution on Stiefel mani-folds has been proposed in (Saisan et al., 2001), arigorous treatment of this matter has yet to come.In this expository paper, we restrict ourselves todescribing an approach to classification and recog-nition based on the metric structure of the spaceof models, for instance nearest-neighbor classifica-tion or k-means clustering (Duda and Hart, 1973).

Suppose a set of samples C1, C2, . . . is given,where each sample is labelled as belonging to oneof c classes λj . Given a new sample C, the labelλm is chosen by taking a vote among the k nearestsamples. That is, λm is selected if the majorityof the k nearest neighbors have label λm, whichhappens with probability

k∑

i=(k+1)/2

(ki

)P (λm|C)i(1− P (λm|C))k−i. (9)

It can be shown (Cover and Hart, 1967) that if k isodd the large-sample 2-class error rate is boundedabove by the smallest concave function of P ∗ –the optimal error rate – greater than

(k−1)/2∑

i=0

(ki

) (P ∗i+1(1− P ∗)ki + P ∗k−i(1− P ∗)i+1

).

(10)Note that the analysis holds for k fixed as n →∞,and that the rule approaches the minimum errorrate for k → ∞. For small samples, there are noknown results except negative counter-examples

that show that an arbitrarily bad error rate canbe achieved.

Distances between identified models (which wehave inferred using the implementation of theN4SID algorithm (Overschee and Moor, 1993) inthe Matlab System Identification Toolbox), canbe computed in a number of ways. First, we havecomputed the “naive” distances (the 2-norm ofthe difference between corresponding system ma-trices, without taking the geometry of the man-ifold into account) and the geodesic distance be-tween models. Not surprisingly these led to quitedisappointing results. Then we have computedtwo metrics between observability subspaces - alsotaking into account the geometry of the manifold :the Finsler (Weinstein, 1999) and a generalizationof the Martin distance, defined in (Martin, 2000)for scalar models. We computed the principalangles between observability subspaces using thealgorithm proposed in (Coch and Moor, 2000),where it is formulated for the scalar case butit can be naturally extended to MIMO systemswhich are innovation models of full-rank vectorprocesses. Then we have calculated the Finslerdistance as the maximum subspace angle, andextended the Martin distance dM by using itsrelation to the subspace angles θi in the scalarcase: dM = − ln

∏i cos2 θi . These two metrics

gave similar results, with a slight advantage forthe latter. To the distance between learned zero-mean models we added the norm of the differencebetween the the means of the joint configurations,weighted by a scale factor whose value was setempirically.

For the purpose of illustration, we show the pic-torial result of an experiment with three classesof motions that result in similar gaits: walking,running and going up and down a staircase. Noticethat these three gaits are quite similar to eachother (as opposed, say, to dancing or jumping),and yet the algorithm proposed is capable of cor-rect classification in most cases. These are admit-tedly preliminary results, and significant work inthis area is needed. In Figure 3 we show sampleframes from the training datasets. Figure 3 showsthe pairwise distance between each model in thedataset. As it can be seen, similar gaits result insmaller distances, with a few outliers. Althoughthis is a very restricted database, it suffices to testour hypothesis.

We have then chosen a few sample sequences foreach category as a test sequence. For each of thesequences we have estimated a model by first pre-processing the sequence (after manual initializa-tion) using the ideas described in (Bregler, 1997)to extract joint coordinates, and finally comparedthe models using a nearest neighbor criterion. Asample frame from the test sequence is shown in

Fig. 3. Sample frames from the dataset of thegaits: waking, running and walking a stair-case.

Figure 3, while the first two corresponding nearestneighbors are shown to the right. Although thisdataset is quite small, the discriminating powerof the model as a representation of the dynamicsequence is visible.

Walk Run Stair

Walk

Run

Stair

Fig. 4. The pairwise distance between each se-quence in the dataset is displayed in thisplot. Each row/column of a matrix repre-sents a sequence, and sequences correspon-dence to similar gaits are grouped in blockrows/columns. Dark indicates a small dis-tance, light a large distance. The minimumdistance is of course along the diagonal, andfor each row the next closest sequence is in-dicated by a circle, while the second nearestis indicated by a cross.

We shall now briefly recall the basics of the dis-tance concept introduced by Martin and elabo-rated upon by (Coch and Moor, 2000). Let M1

.=(A1, C1, K1,Λ1) and M2

.= (A2, C2,K2, Λ2) betwo ARMA models; we define the extended ob-servability matrix

O∞(Mi).=

[CT

i ATi CT

i . . . (ATi )nCT

i . . .]T

.

It turns out that, as shown in (Coch and Moor,2000), the concept of distance defined by Mar-tin can be expressed directly in terms of anglesbetween the (extended) observability spaces. Infact, being M−1

i the inverse of the model Mi

(recall that Mi is minimum-phase), the distancebetween M1 and M2 can be expressed in terms ofthe subspace angles between the column spaces of[O∞(M1) O∞(M−1

2 )]and

[O∞(M2) O∞(M−11 )

].

If we denote by θi the ith canonical angle between

Walk1-1Walk2-1Walk1-2 Run1-1Run3-1Run1-2

Walk1-2Walk2-2Walk1-1 Run2-1Run3-2Run3-1

Walk2-2Walk2-1Walk1-2 Run3-1Run3-2Run2-1

Walk3-3Walk3-1Walk3-4 Run4-2Run6-1Stair3-1

Walk4-1Walk4-2Walk2-1 Run5-1Run4-1Stair3-2

Walk4-2Walk4-1Walk1-2 Run6-1Run4-2Run4-1

Stair1-1Stair2-3Stair1-2 Stair2-2Stair2-3Stair1-1

Stair1-3Stair3-1Stair1-4 Stair3-2Stair3-3Run5-1

Stair1-4Stair1-3Stair1-2 Stair3-3Stair3-2Walk4-2

Fig. 5. For each gait we have chosen a few samplesequences (left) and computed the distanceto every other sequence in the dataset. Theclosest sequence is shown in the central col-umn, while the second nearest is shown inthe right column. With a few exceptions, thenearest neighbor belongs to the same gait asthe test sequence. Notice that all gaits arequite similar; similar experiments performedon much more diverse gaits such as jumpingor dancing return correct classifications.

these spaces the distance defined by Martin isgiven by

dM (M1, M2)2 = ln2n∏

i=1

1cos2θi

, (11)

where n is the model order. Although this istrue for scalar ARMA processes, one can measurethe distance between multivariable ARMA modelsusing the same concept. In this case the link withthe cepstrum is clearly lost. In order to computesubspace angles between models, we proceed asfollows

• Compute the solution Q =(

Q11 Q12

Q21 Q22

)of

the Ljapunov equation

AT QA−Q = −CTCwhere

A =

A1 0 0 00 A2 −K2C2 0 00 0 A2 00 0 0 A1 −K1C1

and

C .=(C1 −C2 C2 −C1

).

• Compute the 2n largest eigenvalues λ1, . . . , λ2n

of (0 Q−1

11 Q12

Q−122 Q21 0

).

• λi = cos θi

We would like to stress that the distance justdefined is independent of the basis chosen in thestate space. The only requirement is that themodel be in innovation canonical form, which isguaranteed by construction in our representation.

It is also worthwhile emphasizing that, as we haveanticipated, the innovation variance Λi does notenter into the distance computation. This impliesthat data which differ just by the innovationvariance will be treated as deriving from the samemodel, at least as far as classification/recognitionis concerned.

4. SEGMENTATION OF DYNAMIC VISUALPROCESSES

In modeling the spatio-temporal statistics of aprocess one is always faced with the conundrum ofwhether to use a complex model that captures theglobal statistics, or choose a simple class of modelsand then partition the scene into regions wherethe model fits the data within a specified accuracy.In the next section, we discuss a simple modelfor partitioning the scene into regions where thespatio-temporal statistics, represented by a modelidentified as we have described above, is constant.

4.1 Spatial segmentation

In this section, which follows (Cremers et al.,2003), we discuss the problem of partitioninga sequence of images into regions that can bedescribed by an ARMA model. Let Ωi ⊂ R2, i =1, . . . , N be a partition of the image into N(unknown) regions 5 . We assume that each pixelcontained in the region Ωi is a Gauss-Markovprocess. In particular, we assume that there exist(unknown) parameters Ai ∈ Rn×n, Ci ∈ Rmi×n,covariance matrices Qi ∈ Rn×n, Ri ∈ Rmi×mi ,white, zero-mean Gaussian processes v(t) ∈Rn, w(t) ∈ Rmi and a process x(t) ∈ Rn suchthat the pixels in each region at each instant oftime, y(t), obey a model of the type (3). Note thatwe allow the number of pixels mi to be different ineach region, as long as

∑Ni=1 mi = m, the size of

the entire image, and that we require that neitherthe regions nor the parameters change over time,Ωi, Ai, Ci, Qi, Ri, xi,0 = const.

5 That is, Ω = ∪Ni=1Ωi and Ωi ∩ Ωj = ∅, i 6= j.

Given this generative model, one way to formalizethe problem of segmenting a sequence of images isthe following: Given a sequence of images y(t) ∈Rm, t = 1, . . . , T with two or more distinctregions Ωi, i = 1, . . . , N ≥ 2 that satisfy the model(3), estimate both the regions Ωi, the unknownstate x(t) in each region and the “signature”of each region, namely the parameters Ai, Ci, theinitial state xi,0 and the covariance of the drivingprocess Qi (the covariance Ri is uninformativeand therefore excluded from the signature). As-suming that the parameters Ai, Ci, Qi, xi,0 havebeen inferred for each region, in order to set thestage for a segmentation procedure, one has todefine a discrepancy measure among regions. Thishas been discussed in the previous section. Now,if the boundaries of each region were known, onecould easily estimate a simple model of the spatio-temporal statistics within each region. Unfortu-nately, in general one does not know the bound-aries of each region, and this is one of the goalsof the inference process. On the other hand, ifthe dynamic signature associated with each pixelwas known, then one could easily determine theregions by thresholding or by other grouping orsegmentation technique. Unfortunately, the modelwe wish to infer is not a point process, andtherefore one cannot pre-compute the signature ofeach pixel and convert the problem into a staticsegmentation at the outset. Therefore, we havea classic “chicken-and-egg” problem: If we knewthe regions, we could easily identify the dynamicalmodels, and if we knew the dynamical models wecould easily segment the regions. Unfortunately,we know neither.

In order to address this problem, one can set up analternating minimization procedure, starting withan initial guess of the regions, Ωi(0), estimatingthe models within each region, Ai(0), Ci(0), Qi(0)), xi,0,and then at any given time t seeking for the modi-fication of the regions Ωi(t), and the update of themodels Ai(t), Ci(t), Qi(t)), xi,0 so as to minimizea chosen cost functional. For instance, one canminimize the norm of the innovation, integratedin space and time. For the sake of example, in thecase of two regions Ω1, Ω2, one could seek for 6

Ωi(t + 1) that solves the following optimizationproblem:

arg minΩi

i=1,2

Ωi

T∑

k=1

‖y(ξ, k)− yi(ξ, k|k − 1)‖dξ

(12)where yi(ξ, k|k−1) is the predictor of y(ξ, k) basedon model i. Once the partition has been estimatedone can update Ai(t+1), Ci(t+1), Qi(t+1), x0(t+1) using for instance subspace identification tech-niques.

6 We use the notation y(ξ, t) to indicate the value of apixel at location ξ ∈ Ω at time t.

Under the assumption that each pixel in a regionobeys the same dynamical model, the minimumof the corresponding functional is attained whenthe distance between the two models (one forregion Ω1, the other one for its complement Ω2) ismaximized. Therefore, it is tempting to formulatethe problem by simultaneously finding the regionsΩi and the models Mi by maximizing the distance(11) subject to the constraint that models Mi

is identified from data in region Ωi. A subopti-mal approach for this task has been proposed in(Cremers et al., 2003).

For each pixel ξ we generate a local spatio-temporal signature given by the cosines of the sub-space angles θj(ξ) between Mξ and a referencemodel, Mξ0 :

s(x) =(cos θ1(ξ), . . . , cos θn(ξ)

). (13)

With the above representation, the problem of dy-namic texture segmentation can be formulated asone of grouping regions of similar spatio-temporalsignature. We propose to perform this group-ing by reverting to the Mumford-Shah functional(Mumford and Shah, 1989). A segmentation of theimage plane Ω into a set of pairwise disjoint re-gions Ωi of constant signature si ∈ Rn is obtainedby minimizing the cost functional

E(Γ, si) =∑

i

Ωi

(s(ξ)− si

)2dξ + ν |Γ|, (14)

simultaneously with respect to the region descrip-tors si modeling the average signature of eachregion, and with respect to the boundary Γ sep-arating these regions (an appropriate represen-tation of which will be introduced in the nextsection). The first term in the functional (14)aims at maximizing the homogeneity with respectto the signatures in each region Ωi, whereas thesecond term aims at minimizing the length |Γ| ofthe separating boundary.

Let the boundary Γ in (14) be given by the zerolevel set of a function φ : Ω → R:

Γ = ξ ∈ Ω |φ(ξ) = 0. (15)

With the Heaviside function

H(φ) =

1 if φ ≥ 00 if φ < 0 , (16)

the functional (14) can be replaced by a functionalon the level set function φ:

E(φ, si) =∫

Ω

(s(ξ)− s1

)2H(φ) dξ

+∫

Ω

(s(ξ)− s2

)2 (1−H(φ)

)dξ

+ ν |Γ|. (17)

We minimize the functional (17) by alternatingthe two fractional steps of:

• Estimating the mean signatures.For fixed φ, minimization with respect to

the region signatures s1 and s2 amounts toaveraging the signatures over each phase:

s1 =∫

s H(φ) dξ∫H(φ) dξ

, s2 =∫

s (1−H(φ)) dξ∫(1−H(φ)) dξ

.

(18)• Boundary evolution.

For fixed region signatures si, minimiza-tion with respect to the embedding functionφ can be implemented by a gradient descentgiven by:

∂φ

∂t= δ(φ)

[ν∇

(∇φ

|∇φ|)

+ (s−s2)2 − (s−s1)2]

,

In Figure 6 we show a few snapshots of the contourevolution, starting from a circle. Notice that thefinal contour is the contour of an “average” regionobtained by combining the different regions intime. Therefore, our approach shows robustnessalso to changes in the original hypotheses thatdynamic textures were spatially stationary.

4.2 Temporal segmentation: filtering and identificationof hybrid systems

In addition to partitioning the spatial domain intoregions of constant statistics, one can partition thetemporal domain, thereby segmenting a continu-ous process into discrete “events”. For instance,the trajectory of joint positions and angles can bepartitioned into segments each corresponding toa particular gait, so that we can detect when awalking person begins running or limping.

This is very much related to filtering and identifi-cation on hybrid systems and in particular JumpLinear Systems. We have addressed some of theissues related to identifying and filtering linearsystems of this kind in (Vidal et al., 2003a; Vidalet al., 2002; Vidal et al., 2003b) where also onecan find a formalization of the problem. We usethe techniques developed there to detect spatio-temporal events from live video, for instance theinception of a fire, or an explosion, or the transi-tion from walking to running of a subject in thefield of view.

5. EXTENSIONS

All that we have discussed above pertains to sys-tems that are either linear or piecewise linear.However, as we have anticipated, if we identifya model within a linear segment and then plot ahistogram of the residual, it is far from Gaussian.An optimality criterion for inference of model and

Fig. 6. In this experiment, courtesy of (Cremers et al., 2003), we show the segmentation of scenesbased on the dynamic content on different regions. The last example is particularly challeng-ing since the regions change over time. Animation of these results can be downloaded fromhttp://www.cs.ucla.edu/˜doretto/projects/dynamic-segmentation.html.

input descriptions can be constructed by requiringthe estimated input sequence v(t) to be a real-ization from a stochastic process that has maxi-mally independent components. Independence canbe expressed in terms of the mutual informationamong input components, which in turn can bewritten in terms of the Kullback-Leibler diver-gence. This approach results in a semi-parametricstatistical inference problem, where one has to si-multaneously infer the (finite-dimensional) modelparameters as well as the (infinite-dimensional)input distribution q. This is essentially an inde-pendent component analysis (ICA) problem.

In its conventional static from, ICA attempts todecompose a random vector into a linear combi-nation of statistically independent components. Ifwe call y ∈ Rm the random vector, then ICA looksfor a matrix C ∈ Rm×n with n ≤ m and a randomvector x ∈ Rn with independent components,px(x1, . . . , xn) = p1(x1) . . . pn(xn) such that

y = Cx. (19)

The unknowns C and pi can be estimated byminimizing the mutual information I(y||Cx) .=∫

py log py

pCxdy, computed or approximated us-

ing a number of independent and identically dis-tributed (IID) samples from py: y(1), . . . ,y(t) IID∼py. Typically the process y is assumed to beergodic, and therefore a time series is used in lieuof a fair sample. What we have here, however, is adynamic ICA problem of separating independent

components mixed by linear dynamical (state-space) systems.

Let us rewrite the output of the model at time t:

y(t) =[CAt, CAt−1B, . . . , CB

]

x(0)v(0)

...v(t− 1)

.= CtV

(20)and stack the observations y(1), . . . ,y(t) into avector Yt to obtain

Yt = CtV. (21)

One may be tempted to invoke the independenceof the components of V – based on the assump-tions that v(t) is white (time samples are inde-pendent) and has independent components – anduse standard ICA to estimate the mixing matrixCt. This, however, does not work because it is notpossible to use time realizations as independentsamples of Y due to the initial condition x(0).One can may conjecture that if t is large enoughand A is stable the effect of initial condition willwane; therefore, the assumption may not be asrestrictive. Under this assumption, the problemof dynamic ICA can be posed as follows. ConsiderYt(k) = [y((k−1)t)T , . . . ,y(kt−1)T ]T , and simi-larly for Vt(k). Furthermore, let Ct be the matrixobtaining by completing, in the sense of Toeplitz,the following matrix

CBCAB CB

......

. . .CAt−1B CAt−2B . . . CB

. (22)

Then A, B, C can be found sub-optimally by firstestimating the mixing matrix Ct having the par-ticular structure above from a set of independentsamples Yt(1), . . . ,Yt(k) (notice that Yt(i) andYt(j) do not share components y(k)):

Ct(A, B,C) = arg minCt

I(Yt(i)||CtVt(i)) (23)

A suboptimal algorithm for identification basedon maximization of input independence has beenproposed in (Bissacco and Saisan, 2002). It isbased on Amari’s natural gradient flow for semi-parametric statistical problems (Amari and Car-doso, 1997). Sampling techniques can also be usedto perform inference, in a particle filtering frame-work.

6. DISCUSSION

We have presented a handful of examples wherecurrent algorithms for system identification canbe successfully employed to address modeling,synthesis and recognition problems in computervision. These include modeling dynamic texturesand human gaits for the purpose of synthesisand classification or recognition. In addition, wehave indicated several directions where furtherwork in system identification is needed in order toaddress difficult tasks of modeling non-Gaussian,non-linear, non-stationary processes for detection,classification, recognition and segmentation.

Acknowledgments

We wish to thank Alessandro Bissacco, DanielCremers, Gianfranco Doretto, Paolo Favaro, PayamSaisan, and Ying-Nian Wu for discussions andexperimental material used in this paper.

REFERENCES

Amari, S. and F. Cardoso (1997). Blindsource separation– semiparametric statisti-cal approach. IEEE Trans. Signal Processing45(11), 2692–2700.

Bissacco, A. and P. Saisan (2002). Modaling hu-man gaits with subtleties. In: Workshop onDynamic Scene Analysis.

Bregler, C. (1997). Learning and recognizing hu-man dynamics in video sequences. In: Proc.of the Conference on Computer Vision andPattern Recognition. pp. 568–574.

Chiuso, A., R. Brockett and S. Soatto (2000). Op-timal structure from motion: local ambigui-ties and global estimates. Intl. J. of ComputerVision 39(3), 195–228.

Coch, K. De and B. De Moor (2000). Subspaceangles and distances between arma models.Proc. of the Intl. Symp. of Math. Theory ofNetworks and Systems.

Cover, T. and P. Hart (1967). Nearest neighborpattern classification. IEEE Trans. on Infor-mation Theory 13, 21–27.

Cremers, D., G. Doretto, P. Favaro and S. Soatto(2003). Dynamic texture segmentation. In:UCLA CSD-TR030014.

Doretto, G., A. Chiuso, Y. Wu and S. Soatto(2003). Dynamic textures. Intl. J. of Comp.Vis. 51(2), 91–109.

Duda, R. O. and P. E. Hart (1973). Patternclassification and scene analysis. Wiley andSons.

Felleman, D. J. and D. C. van Essen (1991). Dis-tributed hierarchical processing in the pri-mate cerebral cortex. Cerebral Cortex 1, 1–47.

Golub, G. and C. Van Loan (1989). Matrix com-putations. 2 ed.. Johns Hopkins UniversityPress.

Hannan, E. J. and M. Deistler (n.d.). The statisti-cal theory of linear systems. Wiley and Sons.

Ljung, L. (1987). System Identification: theory forthe user. Prentice Hall.

Ma, Y., S. Soatto, J. Kosecka and S. Sastry (2003).An invitation to 3D vision, from images tomodels. Springer Verlag.

Martin, R. (2000). A metric for arma pro-cesses. IEEE Trans. on Signal Processing48(4), 1164–1170.

Mumford, D. and J. Shah (1989). Optimal ap-proximations by piecewise smooth functionsand associated variational problems. Comm.on Pure and Applied Mathematics 42, 577–685.

Overschee, P. Van and B. De Moor (1993). Sub-space algorithms for the stochastic identifica-tion problem. Automatica 29, 649–660.

Saisan, P., G. Doretto, Y. Wu and S. Soatto(2001). Dynamic texture recognition. In:Proc. IEEE Conf. on Comp. Vision and Pat-tern Recogn.. pp. II 58–63.

Vidal, R., A. Chiuso and S. Soatto (2002). Ob-servability and identifiability of jump-linearsystems. In: Proc. of the Intl. Conf. on Deci-sion and Control.

Vidal, R., A. Chiuso, S. Soatto and S. Sastry(2003a). Observability of linear hybrid sys-tems. In: Proc. of the Hybrid Systems Com-putation and Control.

Vidal, R., S. Soatto and S. Sastry (2003b). An al-gebraic geometric approach to the identifica-

tion of linear hybrid systems. In: IEEE Conf.on Decision and Control. p. (submitted).

Weinstein, A. (1999). Almost invariant submani-folds for compact group actions.

Zhu, S. C. and Y. Z. Wang (2002). A generativemethod for textured motion: analysis andsynthesis. In: Proc. of the Eur. Conf. onComp. Vision.