What is an event?
“a thing that happens or takes place” ,
Oxford Dictionary
Examples:
Human gestures
Human actions (running, drinking, etc.)
Human interaction (gunfight, car-crash, etc.)
Sports event (tennis, soccer, etc.)
Nature event (fire, storm, etc.)
…
Content-based image and video retrieval 1
Why event recognition?
Huge amounts of video available
Event recognition useful for:
Content-based browsing
“Fast forward to the next goal scoring scene”
Video search
“Show me all videos with Bush and Putin shaking hands”
2Content-based image and video retrieval
Human actions
Human actions are major events in moviecontent
Meaning hidden within visual representation
4Content-based image and video retrieval
What are human actions?
Definition 1:
Physical body motion:
KTH action Database
(http://www.nada.kth.se/cvap/actions/)
5Content-based image and video retrieval
What are human actions?
Definition 2:
Interaction with environment on specific purpose
Same physical motion – different action depending on the context
6Content-based image and video retrieval
Challenges
Variations:
Lighting, appearance, view, background
Individual motion, camera motion
Difference in shape
Difference in motion
Both actions are similar in
overall shape (human
posture) and motion
(hand motion)
Drinking
Smoking
7Content-based image and video retrieval
Challenges
Problems:
Existing datasets contain few action classes, captured in controlled and simplified settings
Lots of (realistic) training data needed
Idea:
Realistic human actions are frequent events within movies
Perform automatic labeling of video sequences based on movie scripts
8Content-based image and video retrieval
Action event dataset
“Coffee and Cigarettes” dataset
159 annotated “Drinking” samples
149 annotated “Smoking” samples
KeyframeFirst framehead rectangle
torso rectangle
Temporal annotationSpatial annotation
Last frame
http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
9Content-based image and video retrieval
Challenges
Problems:
Existing datasets contain few action classes, captured in controlled and simplified settings
Lots of (realistic) training data needed
Idea:
Realistic human actions are frequent events within movies
Perform automatic labeling of video sequences based on movie scripts
Hollywood Human actions (HOHA) dataset
10Content-based image and video retrieval
Script-based annotation
Problems:
No time information available
Use subtitles to align scripts with the video
Described actions do not always correspond to movie scene
Assign an alignment score to scene description:
a = (#matched words)/(#all words)
Apply a Threshold (eg. a>0.5)
Variability of action expressions in text
Use regularized Perceptron to classify action description
11Content-based image and video retrieval
Script alignment
…
1172
01:20:17,240 --> 01:20:20,437
Why weren't you honest with me?
Why'd you keep your marriage a secret?
1173
01:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.
Victor wanted it that way.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
…
…
RICK
Why weren't you honest with me? Why
did you keep your marriage a secret?
Rick sits down with Ilsa.
ILSA
Oh, it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
…
01:20:17
01:20:23
Scripts publicly available to many movies
Subtitles available for the most of movies
Transfer time to scripts by text alignment
subtitles movie script
12Content-based image and video retrieval
Script alignment: Problems
Perfect alignment does not guarantee perfect action annotation in video
From 147 action with correct alignment (a=1), only 70% did match the video
Errors
Temporal misalignment (10%)
Ouside FOV (10%)
Completely missing in video (10%)
Example of false positive for “get out of car”, Action is not visible in video!
(a: quality of subtitle-script matching)
A black car pulls up, two army
officers get out.
14Content-based image and video retrieval
Script-based Annotation
Pros:
Realistic variation of actions
Many examples per class, many classes
No extra overhead for new classes
Character names may be used to resolve “who is doing what?”
Problems:
No spatial localization
Temporal localization may be poor
Script does not always follow the movie
Automatic annotation useful for training, but not precise enough for evaluation
15Content-based image and video retrieval
Retrieving actions in movies
Ivan Laptev and Patric Perez
ICCV 2007
Content-based image and video retrieval 16
Actions == space-time objects?
“atomic”
actions
car exit phoning smoking hand shaking drinking
Take
advantage
of space-
time shape
time time time time
“stable-
view”
objects
17Content-based image and video retrieval
Tem
pora
l slic
e
Actions == space-time objects?
Can actions be considered as space-time objects?
Transfer object detectors to action recognition
Here: only atomic actions considered (i.e. simple actions)
Temporal slice of same action under different circumstances similar
So, (atomic) actions = space-time objects
Content-based image and video retrieval 18
Action features
Action volume = space-time cuboid region around the head (duration of action)
Encoded with block-histogram features f(), = (x,y,t,dx,dy,dt,,), defined by
Location (x,y,t)
Space-time extent (dx,dy,dt)
Type of block ()
el Plane, Temp-2, Spat-4
Type of histogram ()
Histogram of optical flow (HOF)
Histogram of oriented gradient (HOG)
Content-based image and video retrieval 19
Histogram features
(simplified) Histogram of oriented gradient:
Apply gradient operator to each frame within sequence (eg. Sobel)
Bin gradients discretized in 4 orientations to block-histogram
Histogram of optical flow:
Calculate optical flow (OF) between frames
Bin OF vectors discretized in 4 direction bins (+1 bin for no motion) to block-histogram
Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels
More than 1 Mio. possible features f()
Content-based image and video retrieval 21
Histogram features
HOG: histograms of
oriented gradient
HOF: histograms of
optic flow
>10^6 possible features
4 grad. orientation bins
4 OF direction bins
+ 1 bin for no motion22Content-based image and video retrieval
Action learning
Use boosting method (eg. AdaBoost) to classify features within an action volume
Features:
Block-histogram features
Content-based image and video retrieval 23
boosting
selected features
weak classifier
• Efficient discriminative classifier [Freund&Schapire’97]
• Good performance for face detection [Viola&Jones’01]AdaBoost:
Action learning: Boosting
A weak classifier h is a classifier with accuracy only slightly better than chance
Boosting: combine a number of weak classifiers so that the ensemble is arbitrarily accurate
Allows the use of simple (weak) classifiers without the loss if accuracy
Selects features and trains the classifier
Content-based image and video retrieval 24
selected features
weak classifier
Action learning: Boosting
Weak classifier ht:
In case of one dimensional features
select an optimal decision threshold
E.g. for Haar-filter responses (Viola&Jones face detector)
Here: m-dimensional features
Project data on one dimension using Fisher’s Linear Discriminant (FLD), then select optimal threshold in 1-D
Content-based image and video retrieval 25
Haar
features
Histogram
features
Use FLD, select
opt. threshold
optimal
threshold
pre-aligned
samples
Action classification test
Comparison of:
Static keyframe classifier with spatial HOG features (BH-Grad4)
Use Boosting to classify action using features from keyframes (= frame when hand reaches the mouth) only
Space-time action classifier with HOF features (STBH-OF5)
Space-time action classifier with HOF and HOG features (STBH-OFGrad9)
Content-based image and video retrieval 28
Action classification testRandom
motion
patterns
Additional shape information does not seem to improve the space-time classifier
Space-time classifier and static key-frame classifier might have complementary properties
29Content-based image and video retrieval
Classifier properties
Space-time classifier (HOF) Static keyframe classifier(HOG)
Training output: Accumulated feature maps
Space-time classifier and static keyframe classifier might have complementary features
30Content-based image and video retrieval
Classifier properties
Region of selected features show most “active” parts of classifier
High activity at beginning and end of sequence, but low activity at keyframe
Less accuracy expected when only classifying keyframes
Space-time classifier selects most features around hand region
Keyframe classifier selects most features in upper part of the key-frame
Idea: Use complementary properties to combine classifiers (-> keyframe priming)
Content-based image and video retrieval 31
Keyframe priming
Combination of static key-frame classifier with space-time classifier
Motivated by complementary properties of both classifiers
Bootstrap space-time classifier and apply it to keyframes detected by the keyframe detector (boosted space-time window classifier)
Speeds up detection
Combines complementary models
32Content-based image and video retrieval
Keyframe priming
Apply keyframe detector (HOG classification on single frame) to all positions, scales and frames while being set to a high false positive rate
Generate space-time blocks aligned with detected keyframes and with different temporal extent
Run space-time classifier on each hypothesis
33Content-based image and video retrieval
Keyframe priming
Training
Positive
training
sample
Negative
training
samples
Test
34Content-based image and video retrieval
Keyframe-primed
event detectionKeyframe detections
Action detection
Keyframe
priming
No
Keyframe
priming
Test on 25min from “Coffee and Cigarettes” with 38 drinking actions
No overlap with the training set in subjects or scenes
Keyframe priming is faster and leads to significant better results
35Content-based image and video retrieval
Learning realistic human actions from movies
Ivan Laptev, Marcin Marszalek, Cordelia Schmid and
Benjamin Rozenfeld
CVPR 2008
Content-based image and video retrieval 37
Action recognition in real-world videos
Humans can do more than just “drinking” and “smoking”
Robust detection and classification of all kinds of human actions needed
38Content-based image and video retrieval
Space-Time Features: Detector
Space-Time Interest Points (STIP):
Space-Time Extension of Harris Operator
Add dimensionality of time to the second moment matrix
Look for maxima in extended Harris corner function H
Detection depends on spatio-temporal scale
Extract features at multiple levels of spatio-temporal scales (dense scale sampling)
40Content-based image and video retrieval
Space-Time Features: Descriptor
Histogram of oriented spatial
grad. (HOG)
Histogram of optical
flow (HOF)
3x3x2x4bins HOGdescriptor
3x3x2x5bins HOF descriptor
Public code available at www.irisa.fr/vista/actions
Multi-scale space-time patches from corner detector
41Content-based image and video retrieval
Space-Time Features: Descriptor
Compute histogram descriptors of space-time volumes in neighborhood of detected points:
Compute a 4-bin HOG for each cube in 3x3x2 space-time grid
Compute a 5-bin HOF for each cube in 3x3x2 space-time grid
Size of each volume related to detection scales
Content-based image and video retrieval 42
Spatio-temporal bag of features (BoF)
Cluster features (eg. k-means)
Visual vocabulary (here, k=4000)
Assign each feature to nearest vocabulary word
Compute histogram of visual word occurrences over space time volume
different spatio-temporal grids explored:
43Content-based image and video retrieval
Action classification
Use SVMs with multi-channel chi-square kernel:
“Channel” c is a combination of a spatio-temporal grid and a descriptor (HoG or HoF)
D is 2 distance between the BoF-histograms
A is mean value of distances between all training samples
One-against-all approach in case of multi-class classification
44Content-based image and video retrieval
Results on KTH actions dataset
Examples of all six classes and all four scenarios
45Content-based image and video retrieval
Results on KTH actions dataset
Average class accuracy
Confusion matrix
46Content-based image and video retrieval
Results on HOHA dataset
Average Precision for each action class
Comparison of results for annotated (clean) and automatic training data
Chance denotes results of a random classifier47Content-based image and video retrieval
Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment
Dong Xu and Shih-Fu Chang
CVPR 2007
Content-based image and video retrieval 49
Event recognition in news broadcast
Challenging task:
Complex motion, cluttered backgrounds, occlusions, geometric variations of objects
HOF/HOG approaches are sensitive to high-motion regions only
News events may have relatively low motion (eg. fire)
Broadcast news less constrained domain than human actions
Event usually consists of different sub-clips
E.g. “riot” may consist of scenes of fire and smoke at different locations
50Content-based image and video retrieval
Temporally aligned pyramid matching
Algorithm overview:
each frame is represented by one feature vector
decompose video into sub-clips using hierarchical clustering
on each hierarchy level, calculate Earth Mover’s Distance (EMD) between frame features and use it within SVM framework
Constrain EMD that frames from one sub-clip can only be matched to fames from one other sub-clip (alignment)
fuse SVM outputs from all levels
51Content-based image and video retrieval
Decomposition of video into sub-clips
Temporal-constrained Hierarchical Agglomerative Clustering (T-HAC):
First each feature-vector forms a cluster
Construct clusters iteratively by combining existing clusters based on their distances (EMD)
Only merge neighboring clusters in temporal dimension
Clusters form a pyramid like structure (dendrogram)
53Content-based image and video retrieval
Features
Low-level global features:
Grid Color Moment (GCM)
First three moments for each grid region (eg. 5X5 grid)
Gabor Texture feature (GT)
Edge Direction Histogram (EDH) (same as HoG)
Apply Sobel operator and histogram edge directions quantized at 5 degrees
Mid-level feature: Concept Score (CS)
N-dimensional vector (eg. N=108)
each component represents confidence score from a semantic concept classifier (SVMs)
55Content-based image and video retrieval
Temporal alignment
Content-based image and video retrieval 58
Matching in single-level EMD (a) and temporally aligned pyramid matching
(TAPM) (b)
Classification
SVM Classification with Earth Movers Distance (EMD) Kernel:
D(P,Q): EMD between features from sequence P and Q
A: hyper-parameter set empirically through cross-validation
59Content-based image and video retrieval
Fusion
Fuse Information from different levels directly:
L: number of sub-clip levels
g: decision values from the SVMs at different levels l
h: level weights (eg. all weights equal to 1)
Similar to logistic regression/perceptron
60Content-based image and video retrieval
Experiments
Test with LSCOM concepts on TRECVID-2005
56 events/activities annotated (449 total concepts)
10 events chosen
Relatively high occurrence frequency
May be recognized from visual cues intuitively
Number of positive samples for each class: 54-877
61Content-based image and video retrieval
Results
Comparison of single-level EMD (SLEMD) with key-frame classification on different features
62Content-based image and video retrieval
Results
Average Precision on different levels of TAPM
L<x>: classification at level x
d: fusion with non-uniform weights (h0=h1=1; h2=2)
63Content-based image and video retrieval
Summary
Paper1: “Retrieving actions in movies”
Atomic human actions: smoking, drinking
Histogram of oriented gradients (HoG), histograms of oriented flow (HoF), at different spatio-temporal positions & scales + grid-types, ..
Boosting to select good features and to build classifier
Paper2: “Learning realistic human actions from movies”
Eight atomic human actions: Kiss, AnswerPhone, GetOutCar, ..
HoG / HoF at space-time interest points, bag of features
SVM for classification
Paper 3: “Visual event recognition in news videos…”
Decomposes video into sub-clips
Temporal alignment improves classification
Low- and mid-level features, SVM for classification
64Content-based image and video retrieval
References
I. Laptev and P. Perez. Retrieving actions in movies. ICCV '07
I. Laptev, M. Marszalek, C. Schmid and B. Rozenfeld. Learning realistic human actions from movies. CVPR '08
D. Xu and SF. Chang. Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment. CVPR '07
65Content-based image and video retrieval
References
http://www.irisa.fr/vista/Equipe/People/Ivan.Laptev.html
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR '01
R. Schapire, Y. Freund, P. Bartlett and WS. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. ICML '97
66Content-based image and video retrieval
Hiwi
Interested in action recognition?
We are currently looking for a motivated student to teach Armar-III how to recognize situations within rooms (eg. cooking, cleaning)
Required skills:
Interest in Computer Vision
C++ programming experience under Linux
Email [email protected] for more details
Content-based image and video retrieval 67