fcv scene schmid
TRANSCRIPT
Action recognition in videos
Cordelia Schmid
INRIA Grenoble
Action recognition - problem
• Short actions, i.e. drinking, sit down
Coffee & Cigarettes dataset Hollywood dataset
Action recognition - problem
• Short actions, i.e. drinking, sit down
• Activities/events, i.e. making a sandwich, depositing a suspicious object
TRECVID Multimedia Event Detection
TRECVID - Multimedia Event Detection
Attempting a board trick Feeding an animal
Wedding ceremony Getting a vehicle unstuck
Action recognition• Action recognition is person-centric
• Vision is person-centric: We mostly care about things which are important
Movies TV
YouTubeSource I.Laptev
Action recognition• Action recognition is person-centric
• Vision is person-centric: We mostly care about things which are important
40%
35% 34%Movies TV
YouTubeSource I.Laptev
Action recognition from still images
• Description of the human pose
– Silhouette description [Sullivan & Carlsson, 2002]– Histogram of gradients (HOG) [Dalal & Triggs 2005]
– Human body part layout [Felzenszwalb & Huttenlocher, 2000]
Action recognition from still images
• Supervised modeling interaction between human & object [Gupta et al. 2009, Yao & Fei-Fei 2009]
• Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011]
Results on PASCAL VOC 2010 Human action classification dataset
Importance of action objects
• Human pose often not sufficient by itself
• Objects define the actions
Importance of temporal information
• Video/temporal information necessary to disambiguate actions
• Temporal context describes the action/activity
• Key frames provide significant less information
Action recognition in videos
• Temporal information allows to stabilize human and object detection by tracking – J. Malik: tracking by detection is difficult ?
• Large amount of data, very fast growing – H. Sawhney: large amount of data, not often well explored
• Often comes with some sort of supervision, scripts, subtitles – similar in spirit to M. Hebert’s comment on large amount of data collected by a robot
Action recognition in videos
Motion history image[Bobick & Davis, 2001]
Spatial motion descriptor [Efros et al. ICCV 2003]
Learning dynamic prior [Blake et al. 1998]
Sign language recognition[Zisserman et al. 2009]
Action recognition in videos
• Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07]
Histogram of visual words
SVM classifier
Collection of space-time patches
HOG & HOFpatch descriptors
Action recognition in videos
• Bag of space-time features– Many recent extensions: new features / tracklets, temporal
structuring etc.
– Advantages• Very useful as a baseline
• Captures spatial and temporal context – see Efros comment on image classification
– Disadvantages• No interpretation of the action
• Not sufficient for localization & description
Action recognition in videos
• Localization by 3D HOG/HOF, interaction with objects
Tracking by detection and tracking Space-time description
Interaction with objects
Action recognition in videos
• HOG 3D tracks + description
– Detection & tracking of human with part based models works well• move towards more flexible models, similar to P. Felzenszwalb’s
• integrate motion information
• very good baseline
– Towards more flexible descriptions, based on body parts, a lot of recent work on finding human body parts
– Interaction with object important, but hard
Discussion
• Need for more challenging datasets– Need for realistic datasets
– Scale up number of classes (today ~10 actions per dataset)– Increase number of examples per class, possibly with weakly
supervised learning (the number of examples per videos is low)– Define a taxonomy, use redundancy between action classes to
improve training– Manual exhaustive labeling of all actions impossible
KTH dataset Hollywood dataset
Discussion
• Make better use of the large amount of information inherent in videos– automatic collection of additional examples– improve models incrementally – use weak labels from associated data (text, sound, subtitles)
• Many existing techniques are straightforward extensions of methods for images – almost no use of 3D information – learn better interaction and temporal models– design activity models by decomposition into simple actions