two stream self -supervised learning for action recognitionahmdtaha/posters/cvprw2018_poster.pdf ·...
TRANSCRIPT
Two Stream Self-Supervised Learning for Action RecognitionAhmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis
Goals• Tangle spatial and temporal representation for action recognition.• Boost performance on small and imbalanced video datasets
• HMDB and UCF 130 samples per class• Honda Driving Dataset is imbalanced
Approach1. Sequence Verification ? [Shuffle & learn, O3N, OPN]
2. Spatial Temporal Verification ?
ArchitectureGiven a tuple of a RGB frame and a stack of difference (SOD), thenetwork reasons about frame ordering and spatio-temporalcorrespondence.
Our approach supports various motion encodings like stack ofdifference, dynamic Images or optical flow
Experiments : Quantitative Evaluation on HMDB51, UCF101, Honda.
Temporal representation learning can benefit from ImageNet Spatial representation.
Spatial Tower
SOD
RGB Fusion FCN
Class I Class II
Class III Class VI
Supervised
Self-Supervised + Supervised
Sup Self-Sup
UCF101/Split#1 51.53 60.45
UCF101/Split#2 51.73 58.23
UCF101/Split#3 50.05 55.60
HMDB51/Split#1 31.47 33.54
HMDB51/ Split#2 32.91 36.60
Honda 81.72 82.78Motion Tower
Learned Temporal Visualization
Early Overfitting
Temporal filter visualization is uninformative.Input motion reconstruction is utilized forqualitative evaluation. Self-supervisedreconstructions are less noisy compared tosupervised learned features.
Syntactic motion Supervised Self-Supervised Real motion Supervised Self-Supervised
More Details Here
Lack of big annotated datasets can limit supervised learning techniques for many domains like the medical, andautonomous car driving. For these applications, there exists massive data. However, annotating such datasetscan be costly for various reasons; it can be time-consuming as in semantic segmentation or expensive due tomedical field expertise requirement.In this work, we investigate using self-supervisedlearning for videos. Specifically, we focus on trainingdeep CNNs for the task of action recognition.