bilinear models and riemannian metrics for motion classification fabio cuzzolin microsoft research,...

Bilinear models and Riemannian metrics for

motion classification

Fabio Cuzzolin

Microsoft Research, Cambridge, UK11/7/2006

Myself

Master’s thesis on gesturegesture recognitionrecognition at the University of Padova Visiting student, ESSRL, Washington

University in St. Louis Ph.D. thesis on the theory of belief theory of belief

functionsfunctions Young researcher in Milan with the Image

and Sound Processing group Post-doc at UCLA in the Vision Lab

My research

research

Discrete mathematics

linear independence on lattices

Belief functions and imprecise probabilities

geometric approach

algebraic analysis

combinatorial analysis

Computer vision object and body tracking

data association

gesture and action recognition

identity recognition

Today’s talk Motion classificationMotion classification is one of most popular

vision problems Applications: surveillance, biometric, human-

computer interaction Issue: influence of nuisance factorsBilinear models for invariant

gaitID Issue: choice of distance function

Learning Riemannian metrics for motion classification

Bilinear models for invariant gaitID

The identity recognition problem View-invariance in gaitIDBilinear modelsHMMs and a three-layer modelFour experiments on the Mobo database Riemannian metrics for

classification Distance between dynamical modelsLearning a metric from a training setPullback metricsSpaces of linear systems and Fisher metricExperiments on scalar models

GaitID Biometrics increasingly popular Cooperative methods: face recognition, retinal

analysis Surveillance context: non-cooperative users The problem: recognizing the identity of

humans from their gait Methods: dimensionality reduction, silhouette analysis Issues: nuisance factors, viewpoint dependence

A brief review Gait signatures:

Silhouettes [Collins 02, Wang 03] Optical flow, velocity moments, shape symmetry, static body

parameters “Baseline” algorithm [Sarkar 05]

Computes similarity scores between a probe sequence and each gallery (training) sequence by pairwise frame correlation

Methodologies: mostly pattern recognition after dimensionality reduction

Eigenspaces [Abdelkader 01] PCA/MDA [Tolliver 03, Han 04]

Stochastic models (HMMs): [Kale 02, Debrunner 00]

KL-divergence between Markov models

The view-invariance issue

Many different nuisance factorsnuisance factors are involved Viewpoint Illumination Clothes, shoes, carried objects trajectory

Issue: view-invarianceview-invariance possible approaches:

3D tracking Virtual view reconstruction Static body parameters

Approches to view-invariant gait ID

[Cunado 99]: “Evidence gathering” technique coupled oscillators, Fourier description, inclination of thigh and leg

[Urtasun,Fua 04]: fitting 3D temporal motion models to synchronized video sequences

Motion parameters: coefficients of the singular value decomposition of the estimated model angles

[Bhanu,Han 02] matching a 3D kinematic model to 2D silhouettes

extracting a number of feature angles from the fitted model

[Kale 03]: synthetic side-view of the moving person using a single camera

[Shakhnarovich 01]:view-normalization from volumetric intersection of the visual hulls

[Johnson, Bobick 01]: static body parameters recovered across multiple views

Bilinear models From view-invariance to “style” invariance“style” invariance motions usually possess several labels: action,

identity, viewpoint, emotional state, etc. Bilinear modelsBilinear models (Tenenbaum) can be used to

separate the influence of two of those factors, called “style” and “content” (the label to classify)

ySC is a training set of k-dimensional observations with labels S and C

bC is a parameter vector representing content, while AS is a style-specific linear map mapping the content space onto the observation space

CSSC bAy

Bilinear models The “content” of an observation can be

thought of as a vector in an abstract “content space” of some dimension J

bC

AS

ySC

Observations are then derived from content vector linearly, through a map which depends on the “style” parameter S

Learning an asymmetric bilinear model

Given an observation sequence ySC… an asymmetric bilinear model can fitted to the

data through the SVD Y=SUV’ of a stacked SVD Y=SUV’ of a stacked observation matrixobservation matrix

The symmetric model can be written as Y=AB where

least square optimal style and content parameters are

SCS

C

yy

yy

Y

1

111

]',,[ 1 SAAA ],,[ 1 CbbB

JcolUSA ,...,1][ JrowVB ,...,1]'[

Content classification of unknown style

Consider a training set in which persons (content=ID) are seen walking from different viewpoints (style=viewpoint)

when new motions are acquired in which a known person is walking from a different viewpoint (unknown style)…

… an iterative EM procedure can be set up to classify the content (identity)

E step -> estimation of p(c|s), the prob. of the content given the current estimate s of the style

M step -> estimation of the linear map for the unknown style s2

2~

2),~|(

csbAy

ecsyp

Hidden Markov models Finite-state representation of an observation process State process {Xk} is a Markov chain

Given a sequence os observations (feature matrix)... ... EM algorithm for parameter learning (Moore) A->transition probabilities (motion dynamics) C-> means of state-output distributions (poses)

Motions as stacked HMMs Interpretation of the C matrix: columns of C are means of the Interpretation of the C matrix: columns of C are means of the

output distributions associated with the states of the modeloutput distributions associated with the states of the model

In gaitID (cyclic motions) the dynamics is the same for all sequences (A neglected)

A sequence can then be represented as a collection of poses: stacked columns of the C matrixstacked columns of the C matrix

Three-layer model

First layer (feature representation): projection of the contour of the silhouette on a sheaf of lines passing through the center

1

Third layer: bilinear model of HMMs

3

2In the second layer each sequence is encoded as a Markov model, its C matrix is stacked in an observation vector, and a bilinear model is trained over those vectors

Mobo database: 25 people performing 4 different walking actions, from 6 cameras6 cameras

Each sequence has three labels: action, id, viewaction, id, view

MOBO database

Four experiments We can then set up four experiments in which one one

label is chosen as contentlabel is chosen as content, another one as styleanother one as style, and the remaining is considered as a nuisance factor

content style nuisance

actionview-invariantview-invariant

action recognitionaction recognition view ID

actionID-invariantID-invariant

action recognitionaction recognition ID view

IDaction-invariantaction-invariant

gaitIDgaitID action view

IDview-invariantview-invariant

gaitIDgaitID view action

Results – ID versus VIEW Compared performances with “baseline” baseline”

algorithmalgorithm and straight k-NN on sequence HMMs

Results – ID versus action

Performance of the bilinear classifier in the ID vs action experimentID vs action experiment as a function of the nuisance (view=1:5), averaged over all the possible choices of the test action. The average best-match performance of the bilinear classifier is shown in solid red, (minimum and maximum in magenta). The best-3 matches ratio is in dotted red. The average performance of the KL-nearest neighbor classifier is shown in solid black, minimum and maximum in blue. Pure chance is in dashed black.

Feature extraction Type 1: projection of the contourprojection of the contour of the

silhouette on a sheaf of lines passing through the center

Type 2: size functions [Frosini 90] Type 3: Lee’s momentsLee’s moments

Results - influence of features

Left: ID-invariant action recognitionID-invariant action recognition using the bilinear classifier. The entire dataset is considered, regardless the viewpoint. The correct classification percentage is shown as a function of the test identity in black (for models using Lee's features) and red (contour projections). Related mean levels are drawn as dotted lines. Right: View-invariant action recognitionView-invariant action recognition.

Conclusions

Nuisance factorsNuisance factors of paramount importance in gaitID

Bilinear-multilinear modelsBilinear-multilinear models provide a way to separate different factors

Proposed a three-layer modelthree-layer model in which sequence are represented through HMMs

Some approaches to view-invariance are expensive and sensitiveexpensive and sensitive

Experiments on the Mobo database show how much separating factor is effective for motion classification

Future: multilinear models, testing on more realistic setups (many factors, UCF database)



classification Distances between dynamical modelsLearning a metric from a training setPullback metricsSpaces of linear systems and Fisher metricExperiments on scalar models

Distances between dynamical models

Problem: motion classification Approach: representing each movement as a

linear dynamical modellinear dynamical model for instance, each image sequence can be

mapped to an ARMA, or AR linear model Classification is then reduced to find a suitable

distance function in the space of dynamical distance function in the space of dynamical modelsmodels

We can then use this distance in any distance-based classification scheme: k-NN, SVM, etc.

A review of the literature Some distances have been proposed a family of probability distributions depending on a n-

dimensional parameter can be regarded in fact as an n-dimensional manifold, with Fisher information matrixFisher information matrix [Amari]

Kullback-Leibler divergenceKullback-Leibler divergence Gap metricGap metric [Zames,El-Sakkary]: compares graphs

associated with linear systems thought of as input-output maps

Cepstrum normCepstrum norm [Martin] Subspace anglesSubspace angles between column spaces of the

observability matrices

ji

ij

xpxpEg

),(log,),(log

Learning metrics from a training set

All those metrics are task-specific Besides, it makes no sense to choose a single

distance for all possible classification problems as…

Labels can be assigned arbitrarily to dynamical systems, no matter what the underlying structure is

When some a-priori info is available (training set).. .. we can learn in a supervised fashion the “best” .. we can learn in a supervised fashion the “best”

metric for the classification problem!metric for the classification problem! A feasible approach: volume minimization of volume minimization of pullback metricspullback metrics

Learning distances Of course many unsupervised algorithms take an input

dataset and embed it in some other space, implicitly learning a metric (LLE, Laplacian Eigenmaps, etc.)

they fail to learn a full metric for the whole input space, but only images of a set of samples

[Xing, Jordan]: maximizes classification performance for linear maps y=A1/2 x > optimal Mahalanobis optimal Mahalanobis distancedistance reduces to convex optimization

[Shental et al]: relevant component analysisrelevant component analysis – changes the feature space by a global linear transformation which assigns large weights to relevant dimensions" and low weights to irrelevant dimensions

Learning pullback metrics Some notions of differential geometry give us a

tool to build a parameterized family of metrics

The diffeomorphism F induces on M a family of pullback metricspullback metrics

The geodesicsgeodesics of the pullback metric are the liftings of the geodesics associated with the original metric

Consider than a family of diffeomorphisms F between the original space M and a metric space N

M

F

ND

Pullback metrics - detail

)(

:

mFm

MMF

DiffeomorphismDiffeomorphism on M:

MTvMTv

MTMTF

mFm

mm

)(

*

'

:

Push-forwardPush-forward map:

),(),( **)(* vFuFgvug mFm

Given a metric on M, g:TMTM, the

pullback metricpullback metric is

N

k

M

k

k

dmmg

mgDO

1 2

1

2

1

))((det

))((det)( Inverse volumeInverse volume:

Inverse volume maximization

The natural criterion would be to optimize the classification performance

In a nonlinear setup this is hard to formulate and solve

Reasonable to choose a different but related objective function

Effect: finding the manifold which better interpolates the data (i.e. forcing the geodesics to pass through “crowded” regions)

Space of AR(2) models Given an input sequence, we can identify the parameters

of the linear model which better describes it We chose the class of autoregressive models of order 2

AR(2)

21

12

2212121 1

1

)1)(1)(1(

1),(

aa

aa

aaaaaaag

Fisher metric on AR(2)

to get a distance: compute the geodesics of the pullback metric on M

Under stability (|a|<1) and minimality (b 0) this family forms a manifold

0,0|),(0,0|),()1,1,1( babababaM

Space of M(1,1,1) models Consider instead the class of stable discrete-time

linear systems of order 1

After choosing a canonical setting c = 1 the transfer function becomes h(z) = b/(z a)

)()(

)()()1(

kxcky

kubkxakx

Fisher tensor:

20

01),(

rrg )(arctan,

1 2ah

a

br

Families of diffeomorphisms We chose two different families of diffeomorphisms

332211 ,,1

)( mmmm

mFp

For AR(2) systems:

For M(1,1,1) systems: babrbrarbrFp 22 ,),(

Classification of scalar models recognition of actions and identities from image sequences we used the Mobo database scalar feature, AR(2) and M(1,1,1) models

compared performance of all known distances, with pullback Fisher metric

built the geodesic distance used NN algorithm to classify

new sequences

Results - action

Action recognition performance, all views considered – second best distance function

Action recognition performance, all views considered – pullback Fisher metric

Action recognition, view 5 only – difference between classification rates pullback metric – second best

Results – action 2 Recognition performance of the second-best distance

(blue) and the optimal pull-back metric (red), increasing size of training set

View 1 View 5

View 3 View 6

Effect of the training set The size of the training set obviously affects the

recognition rate Systems of the class M(1,1,1) Increasing size of the training set on the abscissae

All views considered

View 2 only

Conclusions Movements can be represented as dynamical systems

Motion classification then reduces to finding a distance between dynamical model

having a training set of such models we can learn the “best” metric for a given classification problem…

… and use it to classify new sequences Pullback metrics induced by the Fisher metric

structure on linear models is a possible choice Design of a family of diffeomorphisms

Future: multidimensional observations, better objective function

bilinear models and riemannian metrics for motion classification fabio cuzzolin microsoft research,...

Documents

scalar models

dynamical models

gaitid bilinear models

markov models

motion classification

stochastic models hmms

gesture recognition

d temporal motion models