discriminative learning and visual interactive behaviorjebara/papers/icprt.pdf · 2002. 3. 20. ·...
TRANSCRIPT
Discriminative Learning andVisual Interactive Behavior
Learning Techniques in Audio-VisualInformation Processing (ICPR Tutorial)
Tony JebaraBrian Clarkson MITSumit Basu MediaAlex Pentland Lab
Outline
-MotivationLearning Tasks and Paradigms
-Discriminative / Conditional LearningMaximum Entropy Discrimination
-Latent Variables and Reversing Jensen’s InequalityCEM and Dual of EM
-Action-Reaction LearningBehavior Analysis / Synthesis via Time Series Prediction
-Wearable PlatformsPersonal Enhanced Reality SystemWearable Interaction Learning
-Conversational Context Learning
Learning Applications for A & V
Classification: Regression / Prediction:
Detection / Clustering: Transduction:
Feature Selection:
xx
x xx
xx
xx
x xx
x
x x
xx
xx
x xx
x
x
x
xx x
x
x
x x
x
xxx
xx x
x
x
x x
x
xx
??
? ??
O?
??
? ??
??
X ??
??
??
? ??
xx
x xx
xx
xx
x xx
OO O
O
OO
OO
O O
Learning Paradigms
New Competitors for Maximum Likelihood:Discriminative Learning & SVMs (NIPS, COLT, UAI, ICML)
-Time Series Prediction (Muller)
-Digit Recognition (Vapnik)
-Speech Recognition (Deng)
-Face Gender Classification (Moghaddam)
-Gene-Sequence Classification (Jaakkola)
-Text Classification (Joachims)
1) Generative Approach:Probabilistic ModelsMaximum Likelihood
2) Discriminative Approach:Support Vector MachinesVC DimensionMaximum MarginTask is ExplicitDiscriminant Surface
Learning Paradigms
2) Discrimination & SVMs
+Model & Data Mismatch
+Support Vectors
+Good Generalization
-Linear Model
-Kernels
-No Priors
-No Missing Data
1) ML & PDFs
+Natural Models (HMMs)
+Priors
+Missing Data
+Flexible
-Poor Performance
-Objective not Task Related
Complementary Pros & Cons
How to combine both? MED...
Maximum EntropyDiscrimination
Tony JebaraTommi JaakkolaMarina Meila
Overview & Motivation
Maximum Entropy Discrimination:Combines probabilistic methods (andextensions) in discriminative framework
-Add task-related Objective to PDFs.-Satisfy task constraints-Support Vectors -> Generalization-Convex, No Local Minima
(vs. Min Class Error)
Feasible MED Extensions & Applications:
Latent variables, various priors, missing labels,structure estimation, anomaly detection.Feature selection, regression, latent transformations,multi-class classification, exponential family.
Classification - Regularization Approach
Given:training examples:binary (+/- 1) labels:discriminant function:non-increasing margin loss:
Minimize: regularization penalty:subject to classification constraints:
Example: SVM minimizes:
with discriminant:decision rule:
�L
\ ^������
48 8
TT
2 L2 � H� � ��
T T TY , 8 T2 � H p �
��� TT
F2 � H� � 4, 8 8 B2 � 2 �
e �Y SIGN , 8� 2
xx
x xx
xx
xx
x xx
xx x
x
xx
xx
x x
�, 8 2\ ^�
�����4
Y Y
Maximum Entropy Discrimination Approach
Many solutions may be valid.Use coarser description of sol’n instead of a single optimum.Solve for distribution P(4) over all good 4 (instead of 4 *).Find that mins subject to constraints:
= prior over models & margins (favors large margins).
Decision Rule:
Information Transfer / Projection: *Information transferred to prior after observations
*Entropic Regularization and Margin Penalties are on the Same Scale
� � ��T T
0 Y , 8 D D T ¯2 H 2 � H 2 H p �¡ °¢ ±¨
e �Y SIGN 0 , 8 D� 2 2 2¨
�0 2 H �+, 0 0
�
�0 2 H
�+, 0 0�
�
�0 2 H
The Admissible Set
�0 2 H
Maximum Entropy Discrimination Solution
Analytic, Unique, Sparse, Parametric & Structural Models:
* normalization constant (partition function)* non-negative Lagrange multipliers* O solved via unique max of concave objective function:
Example: SVM
Example: Generative Models (e-family)
��
� � EXP �T: T T T T
0 0 Y , 8M
¯2 H � 2 H M 2 � H� ¡ °¢ ± : M �\ ^�
�����4
M � M M �
LOG* :M � � M
�� � � �
� �
LOG �4T
T T T TT T TT T T
* Y Y 8 8C ¯� ¬M �M � M � � � M M¡ °� �� ®¡ °¢ ±
� �
\
\� LOG
0 8
0 8, 8 B�
�
R
R2 � �
Use conjugates of for priorT = model parameters and structureb = bias term
\0 8oR
Maximum Entropy Discrimination for Regression
Find that mins subject to constraints:
Decision Rule:
Solution:
Margin Priors: (epsilon-tube)
Example: SVM
� � ��T T T
0 Y , 8 D D T ¯2 H � 2 � H 2 H p �¡ °¢ ±¨
e �Y 0 , 8 D� 2 2 2¨
�0 2 H �+, 0 0
� � ��T T T
0 Y , 8 D D T ¯a2 H H � � 2 2 H p �¡ °¢ ±¨
EXP ��
� EXP �� �
T T T T T
T T T T T
Y , 8
: Y , 80 0
¯M � 2 �H� ¡ °¢ ±M ¯aM � 2 �H a� ¡ °¢ ±
2 H � 2 H
� �
� �
T
T
CT
T
IF0
E IF��H
£ ²b H b�¦ ¦¦ ¦¦ ¦H r ¤ »¦ ¦H ��¦ ¦¦ ¦¥ ¼
��
�
LOG LOG �
LOG LOG �
TT
T
TT
T
CT T T T T T
T T T
4
CT T T TT T T
T T T
* Y E
E 8 8
M�M �
�M
aMa�M �
a�M a a aa
a aM � M � M � � M � M � M � � �
a a a� M � � � � M � M M � M
� � �
� �
Marginprior
PenaltyFn.
Classification and Regression ExamplesGenerative Model Classification:
SVM Regression:
Sinc Fn.(Clean)
SVMLinearKernel
SVM3rd Order
Polynomial
Sinc Fn.With
GaussianNoise
MaximumLikelihood
Full CovarianceGaussians
MEDFull Covariance
Gaussians
Maximum Entropy Discrimination - Cancer
Maximum Entropy Discrimination - Crabs Feature Selection (Extension)
*Isolates interesting dimensions of the data for the given task*Typically needs exponential search:
consider all possible subsets of dimensions*Reduces complexity of data*Also Improves Generalization*Augments Sparse Vectors (SVMs) with Sparse Dimensions*Is possible jointly with parameter estimation.*Can be done discriminatively and efficiently with MED.
Feature Selection
Modify parameters to include a binary ON / OFF Switch
The model contains structuralparameters to aggressively prune features.
Prior:
Switch Prior: Bernoulli distributionp0 parameter smoothly selects no pruning to aggressive pruning
�
�
�N
I I I
I
, 8 S 8�
2 � R �R�\ ^� �
����� � �����N NS S2 � R R
\ ^���IS �
� �
� � �� ��� �� �� �\ ��
S II0 0 0 0 S 0 . ) 0 S
R R R2 � R R � R R �
�
�� � �� I
I
SS
S I0 S P P
�� �
Aggressive attenuationof linear coefficientsat low values (p0=.01).
Prior onI I
S R
Feature Selection in SVM Classification & Results
�
�� �
� ��
LOG � LOG � T T T IT
NY 8
TT
T I
* P P ECM
�
¯ ¯� ¬M ��M � M � � � � �¡ ° ¡ °� �� ®¡ ° ¡ °¢ ±¢ ±� �
ROC of DNA Splice Site100 FeaturesOriginal 25xGATC
ROC DNA Splice Site~5000 FeaturesQuadratic Kernel
CDF of Linear Coeffs DNA Splice Site100 Features
Dashed line: p0 = 0.99999Solid line: p0 = 0.00001
DNA Data: 2-class, 100 element binary vectors. Training Set 500, Testing 4724
O constrained to [0,c] hyper-cube with constraint �T TTYM ��
Feature Selection in SVM Regression & Results
�
�� �
���
� �
LOG LOG �
LOG LOG � LOG �
TT
T
T T T T IT T
T
CT T T T T T T TT
T T T
8
CT I
T
* Y E
E P P E
M�M �
�M
¯a aM M �Ma�M � ¡ °¢ ±a�M
a a aM � M � M � � M � M � T M � M � M � � �
� ¬� �a� M � � � � � � � �� ®
� � � �
� �
Linear Model Estimator Epsilon-Sensitive Linear Loss Least-Squares 1.7584 MED p0 = 0.99999 1.7529 MED p0 = 0.1 1.6894 MED p0 = 0.001 1.5377 MED p0 = 0.00001 1.4808
Linear Model Estimator Epsilon-Sensitive Linear Loss Least-Squares 3.609e+03 MED p0 = 0.00001 1.6734e+03
Boston Housing Data: 13 scalar features. Training Set 481, Testing 25Explicit Quadratic Kernel Expansion Used
D. Ross Cancer Data: 67 scalar features. Training Set 50, Testing 3951
Dashed line: p0 = 0.99999Dotted line: p0 = 0.001Solid line: p0 = 0.00001
Feature Selection in Generative Models
Feature selection is not limited to SVMs.Applies to discriminative Generative Model Estimation as well.But, tractable computation sometimes needs approximations.
Example: 2-class Gaussian distributionsvariable means, identity covariance
Parameters: Prior:Switches: Prior:
Discriminant Function (tractable in this case):
� �\
\� LOG
0 8
0 8 I I I I I II I, 8 B S 8 R 8 B�
�
R
R2 � � � �N � � O �� �
\ ^�N O
\ ^�S R
� �
^ ^ ��0 0 . )N O
�
� � � �^ � I
I
RR
I I0 S 0 R P P
�
� �
Latent Transformations (Extension)
Each input example has additional unobservable propertiesOnly have a prior distribution over the unobservableI.e. category, affine transform, latent variable, alignment
Given: training examples:
binary (+/- 1) labels:
hidden transformations:
transformation function:
prior on transforms:
Solution:Transductive and Iterative. Solve iteratively by alternating solution of P(T) and P(U).
Example:
\ ^������
48 8
e �8 4 8 5�
\ ^������
4Y Y
\ ^������
45 5
� T0 5
�
�� � � � EXP � �
T: T T T T T0 5 0 5 Y , 4 8 5
M
¯2 H � 2 H M 2 � H� ¡ °¢ ±
e � � � �4
T T T T T, 8 , 4 8 5 8 5 B2 � 2 � R � �
G
Optimization & Bounded QP (Extension)
MED maximizes concave objective with convex constraints.Axis-parallel, Newton, gradient descent will converge.
Lower Bound the concave objective with quadraticCan then use SMO, QP, and other SVM optimizers.
Example: SVM Feature Selection
Iterate bound (contact at ) and QPEach QP is seeded at previous sol’nConverges in about 10 fast iterations
�
� �� � �
� � � �LOG � LOG �
4
T T T IT8 -
IJ P P E P P E
¯aM �M M M¡ °¢ ±� ¬� �M � � � � � � � �� �� ®
G G
�
�
4 4
IJ . H- - . CONSTM p M � M � M � M �
G G K�
~O
�
�
�� �
��� � EXP
4
4 P
P P -WHERE . - - AND H
�
� � M M� M M �
� �
� �
MED for the Exponential Family (Extension)
Proof: MED with generative models spans members of theexponential family (where Gaussians generate SVMs):
exponential family form:conjugate prior:
Analytic Partition Function for Classification:
\ EXP 4P 8 ! 8 8 +R � � R � R
\ EXP 4P ! +R D � R � R D � D� �
�
\
\� � �
EXP �
EXP LOG
EXP EXP
EXP EXP
EXP EXP
T T TT
0 8
0 8T TT
4 4
T T T TT
4
T T T T T TT T
T T T T T TT T
: 0 Y , 8 D
: 0 0 0 B Y B D
: ! + Y ! 8 8 + D
: + Y ! 8 ! Y 8 D
: + Y ! 8 + Y 8
�
�
o
o
o
2
R
�� R2
o o o o oR
o o oR
R
� 2 M 2 2
¯� R R M � 2¡ °
¢ ±
� R � R D � D M � R � R R
� � D � M q R � R D � M R
� � D � M q D � M
�¨
�¨
�¨� �¨� �
� �
��
� �
Using Non-InformativePrior on b
Concluding Ideas on MED and Feature Selection
Maximum Entropy Discrimination is a flexible Bayesianregularization approach. It provides a geometric view oflearning as constrained minimization to prior distributions overmargins, parameters, latent variables. It simultaneously combines:
probabilistic methodslarge margin discrimination and SVMsfeature selectionclassification, regression, etc.parameter and structure estimationexponential family generative modelstransduction and detection
Feature Selection is a particularly advantageous extensionwhich provides increased sparsity (support vectors & supportdimensions) and improves generalization.
Limitation of MED
Applies to Exponential FamilyYet many models are MIXTURES of E-family:
Latent ModelsHMMsMixture ModelsIncomplete DataHidden Variable Bayesian Networks
Intractable models in discriminative & conditional settingsThus use Variational Bounds to Perform CalculationsInvoke EM and derive its Discriminative DUALby Reversing Jensen
Reversing Jensen’sInequality
The Dual of EM forDiscriminative Latent Learning
Tony JebaraAlex Pentland
Jensen’s Inequality
-Inequalities allow us to Integrate, Maximize,Evaluate and Derive Intractable Expressions
-Convexity: 1905-1906 by J. Jensen (Dutch Mathematician & Engineer)-See “Convex Functions, Partial Ordering and Statistical Applications”
by J. Pecaric, F. Proschan and Y. Tong.
Jensen in Statistics and EM:-Subsumes many information theoretic bounds (Cover & Thomas)-Subsumes the EM Algorithm (Demspter, Laird & Rubin, Baum-Welch)-EM casts latent variable problems as complete data by solving
for a lower bound on likelihood.
Reversals of Jensen:-Constrained reversals and converses have been explored and are active
areas in mathematics (S.S. Dragomir).-Reversals have yet to be applied to discriminative learning.
The EM Algorithm
x
f(x)
234x x 1x
Makes intractable maximizationof likelihood and integration ofBayesian inference tractable viavariational bounds.
E-step: Replace unknowns withtheir expected values undercurrent model.(i.e. solving for a lowerbound on likelihood using Jensen!)
M-step: Optimize current modelwith the complete data(maximizing the Jensen lower bound!)
Applies and converges for ExponentialFamily Mixtures. I.e. a very large spaceof models that covers most ofcontemporary machine learning.HMMs, Gaussian Mixture Models, etc.
The Exponential Family
\ EXP 40 8 ! 8 8 +2 � � R � R
\ \ �
EXP
EXP
M
M
M
4M M M M M M
M
4MN MN MN MN M M
M N
0 8 P M P 8 M
! 8 8 +
! 8 8 +
2 � R
� B � R � R
� B � R � R
��
��
E-Distribution |____________________|___________________|Gaussian |____________________|___________________|Multinomial |____________________|___________________|Exponential |____________________|___________________|
�
� �LOG �
4 $8 8� � Q�
�
4� R R
LOG � EXPDD
I � R� LOG �� �R � R ��
�
� ! 8 � + R
E-family: K() is a convex function, X in its gradient spaceSpecial Properties: conjugates, linearity, convexity, etc.
Mixtures of E-family: Gaussian mixture models, HMMs, Sigmoidal BeliefNets, Latent Bayes Nets,etc. Most machine learning models!
Includes Poisson, Gaussian, Trees, Multinomial, Exponential
TRACTABLE
INTRACTABLE
Conditional & Discriminative Classification
L = - 8.0L_c = - 1.7
L = - 54.7L_c = + 0.4
TWO WHITEGAUSSIANSPER CLASSFOR MODEL.
DATA REALLYCOMES FROMFOUR WHITEGAUSSIANS
CE
ME
M
PROBLEMATICMAXIMUMLIKELIHOODSITUATIONS
5 10 15 20
4
6
8
10
12
x
y
5 10 15 20
4
6
8
10
12
x
y
CONDITIONALMAXIMUMLIKELIHOODSOLUTIONS
p(y|x)
Conditional & Discriminative Regression
L = - 4.2L_c = - 2.4
L = - 5.2L_c = - 1.8
Discriminative Criteria and Negated Log-Sums
LOG � \ �
LOG � � \ LOG � � \
C
I II
M
I I II I
M M C
L P M C 8
P M C 8 P M C 8
� R
� R � R
� �
� � � ��
LOG � � \I II
M
L P M C 8� R� �
\\ LOG
\
LOG � \ LOG � \M M
P 8, 8
P 8
P M 8 P M 8
�
�
��
RR �
R
� R � R� �
Maximum Likelihood: Clusters towards all data
Maximum Conditional Likelihood: Emphasizes classification task Repels models away from incorrect data
Maximum Margin Discrimination (MED): Emphasizes sparsity anddiscriminant boundary for task
The above log-sums make integration, maximization, etc. intractable.Need to simplify via overall lower and upper bounds...
JensenUses Concavity of log()Reweights data with responsibiltiesVariational Lower Bound at Current Model (T�)
makes tangential contact with true objective function log-sumLocal Computations to get a Global Lower Bound
\ ^ \ ^F % % Fp, ,DANGER: IF WE HAVENEGATED LOG- SUM GETUPPER BOUND INSTEAD!
Reverse-Jensen \ ^ \ ^F % % Fb, ,
Uses convexity of E-familyReweights and Translates dataVariational Upper Bound at Current Model (T�)
makes tangential contact with true objective function log-sumLocal Computations to get a Global Upper Bound
OR (tighter...)
Reverse-Jensen Bounds
Gaussian Case Multinomial Case
Log-sum (gray)Jensen (black)Rev-Jensen (white)
ShortProof
map bowlto bowl
CEM Regression Results
Estimate p(y|x) regressionmodel with 2 Gaussians(Gaussian gates with linearexperts).
Algorithm Regression AccuracyCascade-Correlation (0 hidden) 24.86%Cascade-Correlation (5 hidden) 26.25%C4.5 21.5%Linear Discriminant 0.0%k=5 Nearest Neighbor 3.57%EM 2 Gaussians 22.32%EM&CEM 1 Gaussian 20.79%CEM 2 Gaussians 26.63%
CEM monotonically increasesconditional likelihood unlikeEM. Result: better p(y|x) whichcaptures the multimodality in ywithout wasting resources in x.
0 5 10−2
0
2
4
6
8
10
12
0 10 20 30 40−2.5
−2.45
−2.4
−2.35
−2.3
−2.25
−2.2
−2.15
−2.1
Iterations
Co
nd
itio
na
l Lo
g−
Lik
elih
oo
d
CE
M
CEM Regression Results
0 5 10−2
0
2
4
6
8
10
12
x
y
0 10 20 30 40−2.55
−2.5
−2.45
−2.4
Iterations
Co
nd
itio
na
l Lo
g−
Lik
elih
oo
d
EM
Abalone age data set (UCI).
CEM Classification Results
Gaussian mixture modelshown earlier for classification.Monotonic convergence.Double computation of EM perepoch.
CEM accuracy = 93%EM accuracy = 59%
Multinomial mixture model.3-class multinomials for 60base-pair protein chains.CEM monotonically increasesconditional likelihood.
Y
T Xc
X Y
T
joint conditional
* Each assumes different independencies in data
= Data Structure (like Model Structure which is
useful for learning)
* Exponential number of conditional models
-> Use a handful for frequent task and joint for rare task
-> Use marginal for unreliable covariates
Bayesian Inference: Conditional vs. Joint
³
³444
44
dYXpyxp
dYXyxp
YXyxpyxp
),|()|,(
),|,,(
),|,(),(
¯®
4
44 4
o
o 4
4|
)|,(maxarg
)()|,(maxarg),|(maxarg*
*)|,(),|,(
YXp
pYXpYXp
ML
MAP
where
yxpYXyxp
³³
444
444
dYXpxp
dYXpyxpxyp j
),|()|(
),|()|,()|(
³
³444
44
ccc
cc
c
dYXpxyp
dYXxyp
YXxypxyp
),,(),|(
),,|,(
),,|()|(
Y
T Xc
X Y
T
),(
)()(),|(
),(
)()|(),|(
),(
),(),|(),|(
YXp
pXpXYp
YXp
pXpXYp
YXp
XpXYpYXp
cc
ccc
ccc
44
444
44 4
)|()(),|(),|(
),(
)()(),|(),|()|(
XYpdpXYpxyp
dYXp
pXpXYpxypxyp
cccc
ccc
cc
³
³
4444
444
4
¯®
4
44
o
o 4
4|
),|(maxarg
)(),|(maxarg*
)*,|()|(
Xyp
pXYp
ML
MAPwhere
xypxyp
c
cc
c
cc
c
Bayesian Approach
x: covariatey: responseX: covariate datasetY: response dataset
Of course,Bayesian integrationis impractical so we must considerconditional MAP and ML...
Data: p(y|x) from 4 points tofit with 2 Gaussians
CONDITIONAL DENSITY ESTIMATE
CONDITIONED JOINT DENSITY ESTIMATE
join
tco
nditi
onal
*
*
*
*
X
Y
An Exact Bayesian Example
Annealing Bound can accommodate a Gibbstemperature for global optimum.
Latent Bayesian Networks
Hidden Markov Models
MED - Large Margin Latent Discrimination
Variational Bayesian Inference
Generic Optimization
check http://www.media.mit.edu/~jebara
Extensions
EXP 4! 8 8 + ¯Cq � R � R¢ ±
Action-Reaction Learning
Automatic Visual Analysisand Synthesis of InteractiveBehaviour
Tony JebaraAlex Pentland
Motivation & BackgroundIDEAComputer Vision, Face & Gesture at Killington, Behaviourists...
BEHAVIOUR PERCEPTIONStatic Imagery -> Simple Temporal Models -> Learned TemporalDynamics, Higher Order Control, Multiple Hypothesis, HMMs, NNs(Blake, Bregler, Pentland, Bobick, Hogg)
BEHAVIOUR SYNTHESISCompeting Behaviours, Control, Reinforcement, Ethology, Cog-Sci(Brooks, Terzopolous, Blumberg, Uchibe, Mataric, Large)
ARL (ACTION REACTION LEARNING)-Machine Learning of Correlations between Stimulus & Responsevia Perceptual Measurements of Human Interactions-Imitation Based Learning (Mataric)-Behaviourism (Thorndike, Watson, Skinnerian, Gibsonian) -> Reactionary-Watch Humans Interacting to Learn how to React to Stimulus
ScenarioAUTOMATICUNSUPERVISEDOBSERVATIONOF 2 AGENTINTERACTION
TRACK LIPMOTIONS
DISCOVERCORRELATIONSBETWEEN PASTACTION &CONSEQUENTREACTION
ESTIMATE p(y|x)
System Architecture
OFFLINE: LEARNINGFROM HUMANINTERACTION, SPYINGON TWO USERS TOLEARN p(y|x)
ONLINE: INTERACTIONWITH SINGLE USERWITH LEARNED p(y|x)
PerceptionPROBABILISTICHEAD & HANDTRACKING
EXPECTATIONMAXIMIZATION (EM)
)()(213
1
)()(213
1 23
1
1
||2
)()(
||)2(
)()(
jxyjT
jxy
irgbiT
irgb
ejp
p
eip
p
j j
xy
ii
rgb
PP
PP
S
S
�6��
�6��
�
�
¦
¦
6
6
xx
xx
x
x
Perception...TEMPORAL REPRESENTATION
5 Parameters per Blob = 2 Centroid + 3 Square Root Covariance
GRAPHICALOUTPUT
(Seen by both users)
7RQ\ -HEDUD � 0,7 0HGLD /DERUDWRU\ � ����
yyxyxxyx ***PP
Temporal ModelingTIME SERIES PREDICTION (Gershenfeld, Weigund, Mozer, Wan)
Santa Fe: NNs, RNNs, HMMs, Diff Eqns, etc.Sun Spot, Bach, Physiological, Chaotic Laser
SHORT TERM MEMORY PRE-PROCESSING (Wan, Elliot-Anderson)
> @
> @)()...2()1()(
))(),...,2(),1(()(321321
TttttY
Ttttgt
BBBBBB bbbaaa
���
���|
yyy
yyyy
y Features (Blobs Concatenated)Prediction MappingShort Term Memory (T=120)
Temporal Modeling...PRINCIPAL COMPONENTS ANALYSIS
(linear, FFT, Wavelets…)
Gaussian Distribution of STM (roughly 6.5 seconds)Dims = T x Feats = 120x30 ���! 40 (95% Energy of Submanifold)Low dimensional (i.e. smoothed) characterization of past interaction
eigenvalues eigenvectors eigenspace
LEARN MAPPING PROBABILISTICALLYp(y|x) = p(future | STM) versus deterministic y=g(x)
Learning & The CEM AlgorithmEXPECTATION MAXIMIZATION (EM)
Learns p(x,y) by maximizing (joint model of phenomenon)
Powerful convergence -- Clean statistical FrameworkMore global than gradient solutions -- Can be deterministically annealed
For conditional problems (input/output regression, classification)Joint models are outperformed (I.e. NNs and RNNs versus HMMs)Since they don’t optimize output error (I.e. testing criterion is not like training)
CONDITIONAL EXPECTATION MAXIMIZATION (CEM) NIPS11,1998
Learns p(y|x) by maximizing (conditional model of task)
Convergence properties like EM but for Conditional Likelihood
)|,(1
43
ii
N
iyxp
),|(1
43
ii
N
ixyp
Variational Bound MaximizationMonotonicSkips Local MinimaDeterministically Annealable
Model: Mixture of ExpertsSoft Piece-Wise LinearGaussian Gates with Conditioned Gaussian RegressorsCEM Applies to other models: HMM, Multinomial Mixtures, etc.
Output: Expectation or Arg Max (Regression)
Computationally very efficient, simple matrix operations
Learning and the CEM Algorithm...
x
f(x)
234x x 1x
¦¦
³
M
m m
M
m mm
p
pdp
1
1
)ˆ|ˆ(
)ˆ|ˆ(ˆ)ˆ|(ˆ
xy
xyyyxyyy )ˆ(ˆ
1 xm
xxm
yxm
ymm PP �66�
�
xy
¦¦
6
:*�u6 4 M
m mmm
M
m mmmmmm
N
NNp
1
1
),;(
),;(),;(),|(
PD
XPD
x
xyxxy
EM
CEMIntegration
D
D
y(t)x(t)y(t−1)
A
y(t−1)
y(t−1)
y(t)
A
BB
y(t)
^
^^
D
D
y(t)x(t)y(t−1)
A
By(t)
y(t−1)
y(t−1)
y(t)
A
B
TRAINING MODE
System accumulatesaction / reaction pairs(x,y) and uses CEMto optimize aconditioned Gaussianmixture model forp(y|x)
INTERACTIVE MODE
System completesmissing time series andsynthesizes reaction ingraphical form for userusing the p(y|x)
Integration...FEEDBACK MODE
Predicted measurementon user can be fed backas a non-linear learnedEKF to aid vision. Canalso use p(x) to filtervision and findcorrespondence.
SIMULATION MODE
Fully synthesize bothcomponents of the timeseries (user A and user B).Some instabilities / bugs(no grounding) -> future work.
D
D
y(t)x(t)y(t−1)
A
y(t−1)
y(t−1)
y(t)
A
BB
y(t)
^
^^
Ay(t)^
y(t)x(t)y(t−1)
y(t−1)B
By(t)
^
^^
Ay(t)^
^
y(t−1)A
^
D
D
Training & ResultsTraining Data: 5 minutes, 15 Hz, approx. 5 repetitions of gesturesModel: T=120, Dims=22, M=25 (to maintain real-time)Convergence: 2 hours
Nearest Neighbor: 1.57% RMSConstant Velocity: 0.85% RMSARL: 0.64% RMS
Results...
SCARE INTERACTION
Results...
WAVE INTERACTION
Results...
CLAP INTERACTION
Alternate Perceptual Modalities
0 200 400 600 800 1000 1200 1400 1600 1800 2000−0.4
−0.2
0
0.2
0.4
0.6
0.8
1Rotation Quaternion (q0, q1, q2, q3)FACE MODELING
3D Translation3D PoseFocal Length7 DOFs
3D Eigen ModelDeformationsTexture 40 DOFs
Real-Time(…+ speech +…)
Conclusions
1 - Unsupervised Discovery of Simple Interactive Behaviour by Perceptual Observations and Statistical Time Series Prediction of Future Given Past or Reaction Given Action
2 - Imitation Based Learning of Behaviour
3 - Real-Time Behaviour Acquisition and Interactive Synthesis
4 - Small Amounts of Training Data and Non-Intrusive Training
5 - Non-Linear Predictive Model for Feedback Perception
6 - Monotonically Convergent Maximum Conditional Likelihood i.e. Discriminant Probabilistic Learning (CEM)
7 - No A Priori Segmentation or Classification of Gestures
Wearable Platform:Dynamic PersonalEnhanced Reality System
Tony JebaraBernt SchieleNuria OliverAlex Pentland
DyPERS Architecture* 3 Button interface: Record, Associated and Discard* User records live A/V Clips with wearable* Associates them with a visual trigger object(s)* Audio-Video is replayed when computer vision sees trigger object
DyPERS Visual Recognition* Multidimensional filter-response histograms
from differences of Gaussian linear convolutions(magnitude of 1st deriv. and Laplace operator)
* Compute probability of object from k iid measurements* Kalman filter on probabilities of objects to smooth classifier
Video
Wearable Platform:Interactive BehaviorAcquisition
Tony JebaraAlex Pentland
Hardware-Sony Picturebook Laptop-2 Cameras -2 Microphones
Action-Reaction Learning-Audio-Visual processing-Time Series Prediction-Discriminative Learning-DyPERS associative memory
Wearable Long-term Behavior Acquisition
Tracking ConversationalContext (Audio)
Tony JebaraYuri IvanovAli RahimiAlex Pentland
System Architecture
Tracking Conversational Context for Machine MediationSpeech rec. on multiple speakers with real-time topic-spotting
Bag-of-words (multinomials)
Short Term Memory w/ Decay
Probability of Topic
Coarse descriptor of mood,topic, etc. Used to selectprompt to stimulate conversation
Conversation Trace
??
? ??
O?
??
? ??
??
X ??
??
??
? ??
Clutering & MEDTransductiveapproach forfew labeled.
Text data from 12 Newsgroups
Wearable Platform
Real-time probabilities of topics(politics, health, religion, etc.)Could also detect emotions & situations...
References1) Tony Jebara and Alex Pentland. On Reversing Jensen’s Inequality. In NeuralInformation Processing Systems 13 (NIPS*’00) , Dec. 2000.2) Tony Jebara, Yuri Ivanov, Ali Rahimi and Alex Pentland. Tracking ConversationalContext for Machine Mediation of Human Discourse. In AAAI Fall 2000 Symposium -Socially Intelligent Agents - The Human in the Loop. Nov. 2000.3) Tony Jebara and Tommi Jaakkola. Feature Selection and Dualities in MaximumEntropy Discrimination. In 16th Conf. Uncertainty in Artificial Intelligence (UAI 2000).4) Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum Entropy Discrimination.In Neural Information Processing Systems 12 (NIPS*'99).5) Tony Jebara and Alex Pentland. Action Reaction Learning: Automatic VisualAnalysis and Synthesis of Interactive Behaviour. In 1st Intl. Conf. on Computer VisionSystems (ICVS'99).6) Bernt Schiele, Nuria Oliver, Tony Jebara and Alex Pentland. An InteractiveComputer Vision System, DyPERS: Dynamic Personal Enhanced Reality System. In1st Intl. Conf. on Computer Vision Systems (ICVS'99).7) Tony Jebara and Alex Pentland. Maximum Conditional Likelihood via BoundMaximization and the CEM Algorithm. In Neural Information Processing Systems 11(NIPS*'98) .8) Tony Jebara. Action-Reaction Learning: Analysis and Synthesis of HumanBehaviour. Master's Thesis, MIT, May 1998.