discriminative learning and visual interactive behaviorjebara/papers/icprt.pdf · 2002. 3. 20. ·...

Discriminative Learning andVisual Interactive Behavior

Learning Techniques in Audio-VisualInformation Processing (ICPR Tutorial)

Tony JebaraBrian Clarkson MITSumit Basu MediaAlex Pentland Lab

Outline

-MotivationLearning Tasks and Paradigms

-Discriminative / Conditional LearningMaximum Entropy Discrimination

-Latent Variables and Reversing Jensen’s InequalityCEM and Dual of EM

-Action-Reaction LearningBehavior Analysis / Synthesis via Time Series Prediction

-Wearable PlatformsPersonal Enhanced Reality SystemWearable Interaction Learning

-Conversational Context Learning

Learning Applications for A & V

Classification: Regression / Prediction:

Detection / Clustering: Transduction:

Feature Selection:

xx

x xx

xx

xx

x xx

x

x x

xx

xx

x xx

x

x

x

xx x

x

x

x x

x

xxx

xx x

x

x

x x

x

xx

??

? ??

O?

??

? ??

??

X ??

??

??

? ??

xx

x xx

xx

xx

x xx

OO O

O

OO

OO

O O

Learning Paradigms

New Competitors for Maximum Likelihood:Discriminative Learning & SVMs (NIPS, COLT, UAI, ICML)

-Time Series Prediction (Muller)

-Digit Recognition (Vapnik)

-Speech Recognition (Deng)

-Face Gender Classification (Moghaddam)

-Gene-Sequence Classification (Jaakkola)

-Text Classification (Joachims)

1) Generative Approach:Probabilistic ModelsMaximum Likelihood

2) Discriminative Approach:Support Vector MachinesVC DimensionMaximum MarginTask is ExplicitDiscriminant Surface

Learning Paradigms

2) Discrimination & SVMs

+Model & Data Mismatch

+Support Vectors

+Good Generalization

-Linear Model

-Kernels

-No Priors

-No Missing Data

1) ML & PDFs

+Natural Models (HMMs)

+Priors

+Missing Data

+Flexible

-Poor Performance

-Objective not Task Related

Complementary Pros & Cons

How to combine both? MED...

Maximum EntropyDiscrimination

Tony JebaraTommi JaakkolaMarina Meila

Overview & Motivation

Maximum Entropy Discrimination:Combines probabilistic methods (andextensions) in discriminative framework

-Add task-related Objective to PDFs.-Satisfy task constraints-Support Vectors -> Generalization-Convex, No Local Minima

(vs. Min Class Error)

Feasible MED Extensions & Applications:

Latent variables, various priors, missing labels,structure estimation, anomaly detection.Feature selection, regression, latent transformations,multi-class classification, exponential family.

Classification - Regularization Approach

Given:training examples:binary (+/- 1) labels:discriminant function:non-increasing margin loss:

Minimize: regularization penalty:subject to classification constraints:

Example: SVM minimizes:

with discriminant:decision rule:

�L

\ ^��

48 8

TT

2 L2 � H� � ��

T T TY , 8 T2 � H p �

�� TT

F2 � H� � 4, 8 8 B2 � 2 �

e �Y SIGN , 8� 2

xx

x xx

xx

xx

x xx

xx x

x

xx

xx

x x

�, 8 2\ ^�

��4

Y Y

Maximum Entropy Discrimination Approach

Many solutions may be valid.Use coarser description of sol’n instead of a single optimum.Solve for distribution P(4) over all good 4 (instead of 4 *).Find that mins subject to constraints:

= prior over models & margins (favors large margins).

Decision Rule:

Information Transfer / Projection: *Information transferred to prior after observations

*Entropic Regularization and Margin Penalties are on the Same Scale

� � ��T T

0 Y , 8 D D T ¯2 H 2 � H 2 H p �¡ °¢ ±¨

e �Y SIGN 0 , 8 D� 2 2 2¨

�0 2 H �+, 0 0

�

�0 2 H

�+, 0 0�

�

�0 2 H

The Admissible Set

�0 2 H

Maximum Entropy Discrimination Solution

Analytic, Unique, Sparse, Parametric & Structural Models:

* normalization constant (partition function)* non-negative Lagrange multipliers* O solved via unique max of concave objective function:

Example: SVM

Example: Generative Models (e-family)

��

� � EXP �T: T T T T

0 0 Y , 8M

¯2 H � 2 H M 2 � H� ¡ °¢ ± : M �\ ^�

��4

M � M M �

LOG* :M � � M

��

� �

LOG �4T

T T T TT T TT T T

* Y Y 8 8C ¯� ¬M �M � M � � � M M¡ °� �� ®¡ °¢ ±

� �

\

\� LOG

0 8

0 8, 8 B�

�

R

R2 � �

Use conjugates of for priorT = model parameters and structureb = bias term

\0 8oR

Maximum Entropy Discrimination for Regression

Find that mins subject to constraints:

Decision Rule:

Solution:

Margin Priors: (epsilon-tube)

Example: SVM

� � ��T T T

0 Y , 8 D D T ¯2 H � 2 � H 2 H p �¡ °¢ ±¨

e �Y 0 , 8 D� 2 2 2¨

�0 2 H �+, 0 0

� � ��T T T

0 Y , 8 D D T ¯a2 H H � � 2 2 H p �¡ °¢ ±¨

EXP ��

� EXP ��

T T T T T

T T T T T

Y , 8

: Y , 80 0

¯M � 2 �H� ¡ °¢ ±M ¯aM � 2 �H a� ¡ °¢ ±

2 H � 2 H

� �

� �

T

T

CT

T

IF0

E IF��H

£ ²b H b�¦ ¦¦ ¦¦ ¦H r ¤ »¦ ¦H ��¦ ¦¦ ¦¥ ¼

��

�

LOG LOG �

LOG LOG �

TT

T

TT

T

CT T T T T T

T T T

4

CT T T TT T T

T T T

* Y E

E 8 8

M�M �

�M

aMa�M �

a�M a a aa

a aM � M � M � � M � M � M � � �

a a a� M � � � � M � M M � M

� � �

� �

Marginprior

PenaltyFn.

Classification and Regression ExamplesGenerative Model Classification:

SVM Regression:

Sinc Fn.(Clean)

SVMLinearKernel

SVM3rd Order

Polynomial

Sinc Fn.With

GaussianNoise

MaximumLikelihood

Full CovarianceGaussians

MEDFull Covariance

Gaussians

Maximum Entropy Discrimination - Cancer

Maximum Entropy Discrimination - Crabs Feature Selection (Extension)

*Isolates interesting dimensions of the data for the given task*Typically needs exponential search:

consider all possible subsets of dimensions*Reduces complexity of data*Also Improves Generalization*Augments Sparse Vectors (SVMs) with Sparse Dimensions*Is possible jointly with parameter estimation.*Can be done discriminatively and efficiently with MED.

Feature Selection

Modify parameters to include a binary ON / OFF Switch

The model contains structuralparameters to aggressively prune features.

Prior:

Switch Prior: Bernoulli distributionp0 parameter smoothly selects no pruning to aggressive pruning

�

�

�N

I I I

I

, 8 S 8�

2 � R �R�\ ^� �

�� N NS S2 � R R

\ ^��IS �

� �

� � �� \ ��

S II0 0 0 0 S 0 . ) 0 S

R R R2 � R R � R R �

�

�� I

I

SS

S I0 S P P

��

Aggressive attenuationof linear coefficientsat low values (p0=.01).

Prior onI I

S R

Feature Selection in SVM Classification & Results

�

��

� ��

LOG � LOG � T T T IT

NY 8

TT

T I

* P P ECM

�

¯ ¯� ¬M ��M � M � � � � �¡ ° ¡ °� �� ®¡ ° ¡ °¢ ±¢ ±� �

ROC of DNA Splice Site100 FeaturesOriginal 25xGATC

ROC DNA Splice Site~5000 FeaturesQuadratic Kernel

CDF of Linear Coeffs DNA Splice Site100 Features

Dashed line: p0 = 0.99999Solid line: p0 = 0.00001

DNA Data: 2-class, 100 element binary vectors. Training Set 500, Testing 4724

O constrained to [0,c] hyper-cube with constraint �T TTYM ��

Feature Selection in SVM Regression & Results

�

��

��

� �

LOG LOG �

LOG LOG � LOG �

TT

T

T T T T IT T

T

CT T T T T T T TT

T T T

8

CT I

T

* Y E

E P P E

M�M �

�M

¯a aM M �Ma�M � ¡ °¢ ±a�M

a a aM � M � M � � M � M � T M � M � M � � �

� ¬� �a� M � � � � � � � �� ®

� � � �

� �

Linear Model Estimator Epsilon-Sensitive Linear Loss Least-Squares 1.7584 MED p0 = 0.99999 1.7529 MED p0 = 0.1 1.6894 MED p0 = 0.001 1.5377 MED p0 = 0.00001 1.4808

Linear Model Estimator Epsilon-Sensitive Linear Loss Least-Squares 3.609e+03 MED p0 = 0.00001 1.6734e+03

Boston Housing Data: 13 scalar features. Training Set 481, Testing 25Explicit Quadratic Kernel Expansion Used

D. Ross Cancer Data: 67 scalar features. Training Set 50, Testing 3951

Dashed line: p0 = 0.99999Dotted line: p0 = 0.001Solid line: p0 = 0.00001

Feature Selection in Generative Models

Feature selection is not limited to SVMs.Applies to discriminative Generative Model Estimation as well.But, tractable computation sometimes needs approximations.

Example: 2-class Gaussian distributionsvariable means, identity covariance

Parameters: Prior:Switches: Prior:

Discriminant Function (tractable in this case):

� �\

\� LOG

0 8

0 8 I I I I I II I, 8 B S 8 R 8 B�

�

R

R2 � � � �N � � O ��

\ ^�N O

\ ^�S R

� �

^ ^ ��0 0 . )N O

�

� � � �^ � I

I

RR

I I0 S 0 R P P

�

� �

Latent Transformations (Extension)

Each input example has additional unobservable propertiesOnly have a prior distribution over the unobservableI.e. category, affine transform, latent variable, alignment

Given: training examples:

binary (+/- 1) labels:

hidden transformations:

transformation function:

prior on transforms:

Solution:Transductive and Iterative. Solve iteratively by alternating solution of P(T) and P(U).

Example:

\ ^��

48 8

e �8 4 8 5�

\ ^��

4Y Y

\ ^��

45 5

� T0 5

�

�� EXP � �

T: T T T T T0 5 0 5 Y , 4 8 5

M

¯2 H � 2 H M 2 � H� ¡ °¢ ±

e � � � �4

T T T T T, 8 , 4 8 5 8 5 B2 � 2 � R � �

G

Optimization & Bounded QP (Extension)

MED maximizes concave objective with convex constraints.Axis-parallel, Newton, gradient descent will converge.

Lower Bound the concave objective with quadraticCan then use SMO, QP, and other SVM optimizers.

Example: SVM Feature Selection

Iterate bound (contact at ) and QPEach QP is seeded at previous sol’nConverges in about 10 fast iterations

�

� ��

� � � �LOG � LOG �

4

T T T IT8 -

IJ P P E P P E

¯aM �M M M¡ °¢ ±� ¬� �M � � � � � � � �� ®

G G

�

�

4 4

IJ . H- - . CONSTM p M � M � M � M �

G G K�

~O

�

�

��

�� EXP

4

4 P

P P -WHERE . - - AND H

�

� � M M� M M �

� �

� �

MED for the Exponential Family (Extension)

Proof: MED with generative models spans members of theexponential family (where Gaussians generate SVMs):

exponential family form:conjugate prior:

Analytic Partition Function for Classification:

\ EXP 4P 8 ! 8 8 +R � � R � R

\ EXP 4P ! +R D � R � R D � D� �

�

\

\� � �

EXP �

EXP LOG

EXP EXP

EXP EXP

EXP EXP

T T TT

0 8

0 8T TT

4 4

T T T TT

4

T T T T T TT T

T T T T T TT T

: 0 Y , 8 D

: 0 0 0 B Y B D

: ! + Y ! 8 8 + D

: + Y ! 8 ! Y 8 D

: + Y ! 8 + Y 8

�

�

o

o

o

2

R

�� R2

o o o o oR

o o oR

R

� 2 M 2 2

¯� R R M � 2¡ °

¢ ±

� R � R D � D M � R � R R

� � D � M q R � R D � M R

� � D � M q D � M

�¨

�¨

�¨� �¨� �

� �

��

� �

Using Non-InformativePrior on b

Concluding Ideas on MED and Feature Selection

Maximum Entropy Discrimination is a flexible Bayesianregularization approach. It provides a geometric view oflearning as constrained minimization to prior distributions overmargins, parameters, latent variables. It simultaneously combines:

probabilistic methodslarge margin discrimination and SVMsfeature selectionclassification, regression, etc.parameter and structure estimationexponential family generative modelstransduction and detection

Feature Selection is a particularly advantageous extensionwhich provides increased sparsity (support vectors & supportdimensions) and improves generalization.

Limitation of MED

Applies to Exponential FamilyYet many models are MIXTURES of E-family:

Latent ModelsHMMsMixture ModelsIncomplete DataHidden Variable Bayesian Networks

Intractable models in discriminative & conditional settingsThus use Variational Bounds to Perform CalculationsInvoke EM and derive its Discriminative DUALby Reversing Jensen

Reversing Jensen’sInequality

The Dual of EM forDiscriminative Latent Learning

Tony JebaraAlex Pentland

Jensen’s Inequality

-Inequalities allow us to Integrate, Maximize,Evaluate and Derive Intractable Expressions

-Convexity: 1905-1906 by J. Jensen (Dutch Mathematician & Engineer)-See “Convex Functions, Partial Ordering and Statistical Applications”

by J. Pecaric, F. Proschan and Y. Tong.

Jensen in Statistics and EM:-Subsumes many information theoretic bounds (Cover & Thomas)-Subsumes the EM Algorithm (Demspter, Laird & Rubin, Baum-Welch)-EM casts latent variable problems as complete data by solving

for a lower bound on likelihood.

Reversals of Jensen:-Constrained reversals and converses have been explored and are active

areas in mathematics (S.S. Dragomir).-Reversals have yet to be applied to discriminative learning.

The EM Algorithm

x

f(x)

234x x 1x

Makes intractable maximizationof likelihood and integration ofBayesian inference tractable viavariational bounds.

E-step: Replace unknowns withtheir expected values undercurrent model.(i.e. solving for a lowerbound on likelihood using Jensen!)

M-step: Optimize current modelwith the complete data(maximizing the Jensen lower bound!)

Applies and converges for ExponentialFamily Mixtures. I.e. a very large spaceof models that covers most ofcontemporary machine learning.HMMs, Gaussian Mixture Models, etc.

The Exponential Family

\ EXP 40 8 ! 8 8 +2 � � R � R

\ \ �

EXP

EXP

M

M

M

4M M M M M M

M

4MN MN MN MN M M

M N

0 8 P M P 8 M

! 8 8 +

! 8 8 +

2 � R

� B � R � R

� B � R � R

��

��

E-Distribution |____________________|___________________|Gaussian |____________________|___________________|Multinomial |____________________|___________________|Exponential |____________________|___________________|

�

� �LOG �

4 $8 8� � Q�

�

4� R R

LOG � EXPDD

I � R� LOG �� R � R ��

�

� ! 8 � + R

E-family: K() is a convex function, X in its gradient spaceSpecial Properties: conjugates, linearity, convexity, etc.

Mixtures of E-family: Gaussian mixture models, HMMs, Sigmoidal BeliefNets, Latent Bayes Nets,etc. Most machine learning models!

Includes Poisson, Gaussian, Trees, Multinomial, Exponential

TRACTABLE

INTRACTABLE

Conditional & Discriminative Classification

L = - 8.0L_c = - 1.7

L = - 54.7L_c = + 0.4

TWO WHITEGAUSSIANSPER CLASSFOR MODEL.

DATA REALLYCOMES FROMFOUR WHITEGAUSSIANS

CE

ME

M

PROBLEMATICMAXIMUMLIKELIHOODSITUATIONS

5 10 15 20

4

6

8

10

12

x

y

5 10 15 20

4

6

8

10

12

x

y

CONDITIONALMAXIMUMLIKELIHOODSOLUTIONS

p(y|x)

Conditional & Discriminative Regression

L = - 4.2L_c = - 2.4

L = - 5.2L_c = - 1.8

Discriminative Criteria and Negated Log-Sums

LOG � \ �

LOG � � \ LOG � � \

C

I II

M

I I II I

M M C

L P M C 8

P M C 8 P M C 8

� R

� R � R

� �

� � � ��

LOG � � \I II

M

L P M C 8� R� �

\\ LOG

\

LOG � \ LOG � \M M

P 8, 8

P 8

P M 8 P M 8

�

�

��

RR �

R

� R � R� �

Maximum Likelihood: Clusters towards all data

Maximum Conditional Likelihood: Emphasizes classification task Repels models away from incorrect data

Maximum Margin Discrimination (MED): Emphasizes sparsity anddiscriminant boundary for task

The above log-sums make integration, maximization, etc. intractable.Need to simplify via overall lower and upper bounds...

JensenUses Concavity of log()Reweights data with responsibiltiesVariational Lower Bound at Current Model (T�)

makes tangential contact with true objective function log-sumLocal Computations to get a Global Lower Bound

\ ^ \ ^F % % Fp, ,DANGER: IF WE HAVENEGATED LOG- SUM GETUPPER BOUND INSTEAD!

Reverse-Jensen \ ^ \ ^F % % Fb, ,

Uses convexity of E-familyReweights and Translates dataVariational Upper Bound at Current Model (T�)

makes tangential contact with true objective function log-sumLocal Computations to get a Global Upper Bound

OR (tighter...)

Reverse-Jensen Bounds

Gaussian Case Multinomial Case

Log-sum (gray)Jensen (black)Rev-Jensen (white)

ShortProof

map bowlto bowl

CEM Regression Results

Estimate p(y|x) regressionmodel with 2 Gaussians(Gaussian gates with linearexperts).

Algorithm Regression AccuracyCascade-Correlation (0 hidden) 24.86%Cascade-Correlation (5 hidden) 26.25%C4.5 21.5%Linear Discriminant 0.0%k=5 Nearest Neighbor 3.57%EM 2 Gaussians 22.32%EM&CEM 1 Gaussian 20.79%CEM 2 Gaussians 26.63%

CEM monotonically increasesconditional likelihood unlikeEM. Result: better p(y|x) whichcaptures the multimodality in ywithout wasting resources in x.

0 5 10−2

0

2

4

6

8

10

12

0 10 20 30 40−2.5

−2.45

−2.4

−2.35

−2.3

−2.25

−2.2

−2.15

−2.1

Iterations

Co

nd

itio

na

l Lo

g−

Lik

elih

oo

d

CE

M

CEM Regression Results

0 5 10−2

0

2

4

6

8

10

12

x

y

0 10 20 30 40−2.55

−2.5

−2.45

−2.4

Iterations

Co

nd

itio

na

l Lo

g−

Lik

elih

oo

d

EM

Abalone age data set (UCI).

CEM Classification Results

Gaussian mixture modelshown earlier for classification.Monotonic convergence.Double computation of EM perepoch.

CEM accuracy = 93%EM accuracy = 59%

Multinomial mixture model.3-class multinomials for 60base-pair protein chains.CEM monotonically increasesconditional likelihood.

Y

T Xc

X Y

T

joint conditional

* Each assumes different independencies in data

= Data Structure (like Model Structure which is

useful for learning)

* Exponential number of conditional models

-> Use a handful for frequent task and joint for rare task

-> Use marginal for unreliable covariates

Bayesian Inference: Conditional vs. Joint

³

³444

44

dYXpyxp

dYXyxp

YXyxpyxp

),|()|,(

),|,,(

),|,(),(

¯®

4

44 4

o

o 4

4|

)|,(maxarg

)()|,(maxarg),|(maxarg*

*)|,(),|,(

YXp

pYXpYXp

ML

MAP

where

yxpYXyxp

³³

444

444

dYXpxp

dYXpyxpxyp j

),|()|(

),|()|,()|(

³

³444

44

ccc

cc

c

dYXpxyp

dYXxyp

YXxypxyp

),,(),|(

),,|,(

),,|()|(

Y

T Xc

X Y

T

),(

)()(),|(

),(

)()|(),|(

),(

),(),|(),|(

YXp

pXpXYp

YXp

pXpXYp

YXp

XpXYpYXp

cc

ccc

ccc

44

444

44 4

)|()(),|(),|(

),(

)()(),|(),|()|(

XYpdpXYpxyp

dYXp

pXpXYpxypxyp

cccc

ccc

cc

³

³

4444

444

4

¯®

4

44

o

o 4

4|

),|(maxarg

)(),|(maxarg*

)*,|()|(

Xyp

pXYp

ML

MAPwhere

xypxyp

c

cc

c

cc

c

Bayesian Approach

x: covariatey: responseX: covariate datasetY: response dataset

Of course,Bayesian integrationis impractical so we must considerconditional MAP and ML...

Data: p(y|x) from 4 points tofit with 2 Gaussians

CONDITIONAL DENSITY ESTIMATE

CONDITIONED JOINT DENSITY ESTIMATE

join

tco

nditi

onal

*

*

*

*

X

Y

An Exact Bayesian Example

Annealing Bound can accommodate a Gibbstemperature for global optimum.

Latent Bayesian Networks

Hidden Markov Models

MED - Large Margin Latent Discrimination

Variational Bayesian Inference

Generic Optimization

check http://www.media.mit.edu/~jebara

Extensions

EXP 4! 8 8 + ¯Cq � R � R¢ ±

Action-Reaction Learning

Automatic Visual Analysisand Synthesis of InteractiveBehaviour


Motivation & BackgroundIDEAComputer Vision, Face & Gesture at Killington, Behaviourists...

BEHAVIOUR PERCEPTIONStatic Imagery -> Simple Temporal Models -> Learned TemporalDynamics, Higher Order Control, Multiple Hypothesis, HMMs, NNs(Blake, Bregler, Pentland, Bobick, Hogg)

BEHAVIOUR SYNTHESISCompeting Behaviours, Control, Reinforcement, Ethology, Cog-Sci(Brooks, Terzopolous, Blumberg, Uchibe, Mataric, Large)

ARL (ACTION REACTION LEARNING)-Machine Learning of Correlations between Stimulus & Responsevia Perceptual Measurements of Human Interactions-Imitation Based Learning (Mataric)-Behaviourism (Thorndike, Watson, Skinnerian, Gibsonian) -> Reactionary-Watch Humans Interacting to Learn how to React to Stimulus

ScenarioAUTOMATICUNSUPERVISEDOBSERVATIONOF 2 AGENTINTERACTION

TRACK LIPMOTIONS

DISCOVERCORRELATIONSBETWEEN PASTACTION &CONSEQUENTREACTION

ESTIMATE p(y|x)

System Architecture

OFFLINE: LEARNINGFROM HUMANINTERACTION, SPYINGON TWO USERS TOLEARN p(y|x)

ONLINE: INTERACTIONWITH SINGLE USERWITH LEARNED p(y|x)

PerceptionPROBABILISTICHEAD & HANDTRACKING

EXPECTATIONMAXIMIZATION (EM)

)()(213

1

)()(213

1 23

1

1

||2

)()(

||)2(

)()(

jxyjT

jxy

irgbiT

irgb

ejp

p

eip

p

j j

xy

ii

rgb

PP

PP

S

S

�6��

�6��

�

�

¦

¦

6

6

xx

xx

x

x

Perception...TEMPORAL REPRESENTATION

5 Parameters per Blob = 2 Centroid + 3 Square Root Covariance

GRAPHICALOUTPUT

(Seen by both users)

7RQ\ -HEDUD � 0,7 0HGLD /DERUDWRU\ � ��

yyxyxxyx ***PP

Temporal ModelingTIME SERIES PREDICTION (Gershenfeld, Weigund, Mozer, Wan)

Santa Fe: NNs, RNNs, HMMs, Diff Eqns, etc.Sun Spot, Bach, Physiological, Chaotic Laser

SHORT TERM MEMORY PRE-PROCESSING (Wan, Elliot-Anderson)

> @

> @)()...2()1()(

))(),...,2(),1(()(321321

TttttY

Ttttgt

BBBBBB bbbaaa

��

��|

yyy

yyyy

y Features (Blobs Concatenated)Prediction MappingShort Term Memory (T=120)

Temporal Modeling...PRINCIPAL COMPONENTS ANALYSIS

(linear, FFT, Wavelets…)

Gaussian Distribution of STM (roughly 6.5 seconds)Dims = T x Feats = 120x30 ��! 40 (95% Energy of Submanifold)Low dimensional (i.e. smoothed) characterization of past interaction

eigenvalues eigenvectors eigenspace

LEARN MAPPING PROBABILISTICALLYp(y|x) = p(future | STM) versus deterministic y=g(x)

Learning & The CEM AlgorithmEXPECTATION MAXIMIZATION (EM)

Learns p(x,y) by maximizing (joint model of phenomenon)

Powerful convergence -- Clean statistical FrameworkMore global than gradient solutions -- Can be deterministically annealed

For conditional problems (input/output regression, classification)Joint models are outperformed (I.e. NNs and RNNs versus HMMs)Since they don’t optimize output error (I.e. testing criterion is not like training)

CONDITIONAL EXPECTATION MAXIMIZATION (CEM) NIPS11,1998

Learns p(y|x) by maximizing (conditional model of task)

Convergence properties like EM but for Conditional Likelihood

)|,(1

43

ii

N

iyxp

),|(1

43

ii

N

ixyp

Variational Bound MaximizationMonotonicSkips Local MinimaDeterministically Annealable

Model: Mixture of ExpertsSoft Piece-Wise LinearGaussian Gates with Conditioned Gaussian RegressorsCEM Applies to other models: HMM, Multinomial Mixtures, etc.

Output: Expectation or Arg Max (Regression)

Computationally very efficient, simple matrix operations

Learning and the CEM Algorithm...

x

f(x)

234x x 1x

¦¦

³

M

m m

M

m mm

p

pdp

1

1

)ˆ|ˆ(

)ˆ|ˆ(ˆ)ˆ|(ˆ

xy

xyyyxyyy )ˆ(ˆ

1 xm

xxm

yxm

ymm PP �66�

�

xy

¦¦

6

:*�u6 4 M

m mmm

M

m mmmmmm

N

NNp

1

1

),;(

),;(),;(),|(

PD

XPD

x

xyxxy

EM

CEMIntegration

D

D

y(t)x(t)y(t−1)

A

y(t−1)

y(t−1)

y(t)

A

BB

y(t)

^

^^

D

D

y(t)x(t)y(t−1)

A

By(t)

y(t−1)

y(t−1)

y(t)

A

B

TRAINING MODE

System accumulatesaction / reaction pairs(x,y) and uses CEMto optimize aconditioned Gaussianmixture model forp(y|x)

INTERACTIVE MODE

System completesmissing time series andsynthesizes reaction ingraphical form for userusing the p(y|x)

Integration...FEEDBACK MODE

Predicted measurementon user can be fed backas a non-linear learnedEKF to aid vision. Canalso use p(x) to filtervision and findcorrespondence.

SIMULATION MODE

Fully synthesize bothcomponents of the timeseries (user A and user B).Some instabilities / bugs(no grounding) -> future work.

D

D

y(t)x(t)y(t−1)

A

y(t−1)

y(t−1)

y(t)

A

BB

y(t)

^

^^

Ay(t)^

y(t)x(t)y(t−1)

y(t−1)B

By(t)

^

^^

Ay(t)^

^

y(t−1)A

^

D

D

Training & ResultsTraining Data: 5 minutes, 15 Hz, approx. 5 repetitions of gesturesModel: T=120, Dims=22, M=25 (to maintain real-time)Convergence: 2 hours

Nearest Neighbor: 1.57% RMSConstant Velocity: 0.85% RMSARL: 0.64% RMS

Results...

SCARE INTERACTION

Results...

WAVE INTERACTION

Results...

CLAP INTERACTION

Alternate Perceptual Modalities

0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Rotation Quaternion (q0, q1, q2, q3)FACE MODELING

3D Translation3D PoseFocal Length7 DOFs

3D Eigen ModelDeformationsTexture 40 DOFs

Real-Time(…+ speech +…)

Conclusions

1 - Unsupervised Discovery of Simple Interactive Behaviour by Perceptual Observations and Statistical Time Series Prediction of Future Given Past or Reaction Given Action

2 - Imitation Based Learning of Behaviour

3 - Real-Time Behaviour Acquisition and Interactive Synthesis

4 - Small Amounts of Training Data and Non-Intrusive Training

5 - Non-Linear Predictive Model for Feedback Perception

6 - Monotonically Convergent Maximum Conditional Likelihood i.e. Discriminant Probabilistic Learning (CEM)

7 - No A Priori Segmentation or Classification of Gestures

Wearable Platform:Dynamic PersonalEnhanced Reality System

Tony JebaraBernt SchieleNuria OliverAlex Pentland

DyPERS Architecture* 3 Button interface: Record, Associated and Discard* User records live A/V Clips with wearable* Associates them with a visual trigger object(s)* Audio-Video is replayed when computer vision sees trigger object

DyPERS Visual Recognition* Multidimensional filter-response histograms

from differences of Gaussian linear convolutions(magnitude of 1st deriv. and Laplace operator)

* Compute probability of object from k iid measurements* Kalman filter on probabilities of objects to smooth classifier

Video

Wearable Platform:Interactive BehaviorAcquisition


Hardware-Sony Picturebook Laptop-2 Cameras -2 Microphones

Action-Reaction Learning-Audio-Visual processing-Time Series Prediction-Discriminative Learning-DyPERS associative memory

Wearable Long-term Behavior Acquisition

Tracking ConversationalContext (Audio)

Tony JebaraYuri IvanovAli RahimiAlex Pentland

System Architecture

Tracking Conversational Context for Machine MediationSpeech rec. on multiple speakers with real-time topic-spotting

Bag-of-words (multinomials)

Short Term Memory w/ Decay

Probability of Topic

Coarse descriptor of mood,topic, etc. Used to selectprompt to stimulate conversation

Conversation Trace

??

? ??

O?

??

? ??

??

X ??

??

??

? ??

Clutering & MEDTransductiveapproach forfew labeled.

Text data from 12 Newsgroups

Wearable Platform

Real-time probabilities of topics(politics, health, religion, etc.)Could also detect emotions & situations...

References1) Tony Jebara and Alex Pentland. On Reversing Jensen’s Inequality. In NeuralInformation Processing Systems 13 (NIPS*’00) , Dec. 2000.2) Tony Jebara, Yuri Ivanov, Ali Rahimi and Alex Pentland. Tracking ConversationalContext for Machine Mediation of Human Discourse. In AAAI Fall 2000 Symposium -Socially Intelligent Agents - The Human in the Loop. Nov. 2000.3) Tony Jebara and Tommi Jaakkola. Feature Selection and Dualities in MaximumEntropy Discrimination. In 16th Conf. Uncertainty in Artificial Intelligence (UAI 2000).4) Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum Entropy Discrimination.In Neural Information Processing Systems 12 (NIPS*'99).5) Tony Jebara and Alex Pentland. Action Reaction Learning: Automatic VisualAnalysis and Synthesis of Interactive Behaviour. In 1st Intl. Conf. on Computer VisionSystems (ICVS'99).6) Bernt Schiele, Nuria Oliver, Tony Jebara and Alex Pentland. An InteractiveComputer Vision System, DyPERS: Dynamic Personal Enhanced Reality System. In1st Intl. Conf. on Computer Vision Systems (ICVS'99).7) Tony Jebara and Alex Pentland. Maximum Conditional Likelihood via BoundMaximization and the CEM Algorithm. In Neural Information Processing Systems 11(NIPS*'98) .8) Tony Jebara. Action-Reaction Learning: Analysis and Synthesis of HumanBehaviour. Master's Thesis, MIT, May 1998.

discriminative learning and visual interactive behaviorjebara/papers/icprt.pdf · 2002. 3. 20. ·...

Documents