local descriptors for spatio-temporal recognition computational vision and active perception...
Post on 18-Dec-2015
215 views
TRANSCRIPT
Local DescriptorsLocal Descriptorsfor Spatio-Temporal Recognitionfor Spatio-Temporal Recognition
Computational Vision and Active Perception Laboratory (CVAP)Dept of Numerical Analysis and Computer Science
KTH (Royal Institute of Technology)SE-100 44 Stockholm, Sweden
Ivan Laptev and Tony Lindeberg
Motivation
Area: Interpretation of non-rigid motion
Non-rigid motion results in visual events such as Occlusions, disocclusions Appearance, disappearance Unifications, splits Velocity discontinuities
Events are often characterized by non-constant motion and complex spatio-temporal appearance.
Events provide a compact way to capture important aspects of spatio-temporal structure.
Local Motion Events
Idea: look for spatio-temporal neighborhoods that maximize the local variation of image values over space and time
Interest points
Spatial domain (Harris and Stephens, 1988):
Select space-time maxima of
Analogy in space-time:
Select maxima over (x,y) of
points with high variation of image values over space and time. (Laptev and Lindeberg, ICCV’03)
where
Synthetic examples
Velocity discontinuity(spatio-temporal ”corner”)
Unification and split
Image transformations
• p p’• Spatial scale:
• p p’•
Temporal scale:
• p p’•
Galilean transformation:
Estimate locally to obtain invariance to these transformations (Laptev and Lindeberg ICCV’03, ICPR’04)
Invariance with respect to size changes
Feature detection:Selection of spatial scale
Stationary cameraStabilized camera
Feature detection:Velocity adaptation
Selection of temporal scales captures the temporal extent of events
Feature detection:Selection of temporal scale
Features from human actions
Why local features in space-time?
Make a sparse and informative representation of complex motion patterns;
Obtain robustness w.r.t. missing data (occlusions) and outliers (complex, dynamic backgrounds, multiple motions);
Match similar events in image sequences;
Recognize image patterns of non-rigid motion.
Do not rely on tracking or spatial segmentation prior to motion recognition
Space-time neighborhoodsboxing
walking
hand waving
Local space-time descriptors
Describe image structures in the neighborhoods of detected features defined by positions and covariance matrices
where
A well-founded choice of local descriptors is the local jet (Koenderink and van Doorn, 1987) computed from spatio-
temporal Gaussian derivatives (here at interest points pi)
Use of descriptors: Clustering
c1
c2
c3
c4
Clustering
Classification
Group similar points in the space of image descriptors using K-means clustering
Select significant clusters
Use of descriptors: Clustering
Use of descriptors: Matching Find similar events in pairs of video sequences
Other descriptors better?
Multi-scale spatio-temporal derivatives
Consider the following choices:
Spatio-temporal neighborhood
Projections to orthogonal bases obtained with PCA
Histogram-based descriptors
Multi-scale derivative filtersDerivatives up to order 2 or 4; 3 spatial scales; 3 temporal scales: 9 x 3 x 3 = 81 or 34 x 3 x 3 = 306 dimensional descriptors
PCA descriptors Compute normal flow or optic flow in locally adapted spatio-
temporal neighborhoods of features Subsample the flow fields to resolution 9x9x9 pixels Learn PCA basis vectors (separately for each flow) from
features in training sequences Project flow fields of the new features onto the 100 most
significant eigen-flow-vectors:
Position-dependent histograms
...
Divide the neighborhood i of each point pi into M^3 subneighborhoods, here M=1,2,3
Compute space-time gradients (Lx, Ly, Lt)T or optic flow (vx,
vy)T at combinations of 3 temporal and 3 spatial scales
where are locally adapted detection scales Compute separable histograms over all
subneighborhoods, derivatives/velocities and scales
Evaluation: Action Recognition
walking running jogging handwaving handclapping boxing
Database:
Initially, recognition with Nearest Neighbor Classifier (NNC): Take sequences of X subjects for training (Strain) For each test sequence stest find the closest training
sequence strain,i by minimizing the distance
Action of stest is regarded as recognized if class(stest)= class(strain,i)
Results: Recognition rates (all)
Scale-adapted featuresScale and velocity adapted
features
Results: Recognition rates (Hist)
Scale-adapted featuresScale and velocity adapted
features
Results: Recognition rates (Jets)
Scale-adapted featuresScale and velocity adapted
features
Results: Comparison
Global-STG-HIST: Zelnik-Manor and Irani CVPR’01
Spatial-4Jets: Spatial interest points (Harris and Stephens, 1988)
Confusion matrices
Position-dependent histograms for space-time interest points
Local jets at spatial interest points
STG-PCA, ED STG-PD2HIST, ED
Confusion matrices
Related work
Mikolayczyk and Schmid CVPR’03, ECCV’02 Lowe ICCV’99 Zelnik and Irani CVPR’01 Fablet, Bouthemy and Peréz PAMI’02 Laptev and Lindeberg ICCV’03, IVC 2004, ICPR’04 Efros et.al. ICCV’03 Harris and Stephens Alvey’88 Koenderink and Doorn PAMI 1992 Lindeberg IJCV 1998
Summary
Descriptors of local spatio-temporal features enable classification and matching of motion events in video
Position-dependent histograms of space-time gradients and optical flow give high recognition performance. Results consistent with findings for SIFT descriptor (Lowe, 1999) in the spatial domain.
Future: Include spatial and temporal consistency of local
features Multiple actions in the scene Information inbetween events
walking running jogging handwaving handclapping boxing
Results: Recognition Rates
Scalar product Distance Euclidean Distance
Walking model
Represent the gait pattern using classified spatio-temporal points corresponding the one gait cycle
Define the state of the model X for the moment t0 by the position, the size, the phase and the velocity of a person:
Associate each phase with a silhouette of a person extracted from the original sequence
Sequence alignment Given a data sequence with the current moment t0,
detect and classify interest points in the time window of length tw: (t0, t0-tw)
Transform model features according to X and for each model feature fm,i=(xm,i, ym,i, tm,i, m,i, m,i, cm,i) compute its distance di to the most close data feature fd,j, cd,j=cm,i:
Define the ”fit function” D of model configuration X as a sum of distances of all features weighted w.r.t. their ”age” (t0-tm) such that recent features get more influence on the matching
Sequence alignment
data featuresmodel features
At each moment t0 minimize D with respect to X using standard Gauss-Newton minimization method
Experiments
Experiments