multilayer and multimodal fusion of deep neural...
TRANSCRIPT
![Page 1: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/1.jpg)
Xiaodong Yang, Pavlo Molchanov, Jan KautzXiaodong Yang, Pavlo Molchanov, Jan Kautz
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
![Page 2: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/2.jpg)
22
INTELLIGENT VIDEO ANALYTICS
Surveillance event detection
Human-computer interaction
Multimedia search and indexing
@bmw.com
Video Classification
![Page 3: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/3.jpg)
33
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
![Page 4: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/4.jpg)
44
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
Dense trajectories,H. Wang et al. ICCV 2013
![Page 5: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/5.jpg)
55
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
Bag-of-visual-words,J. Gemert et al. TPAMI 2009
Fisher vector,F. Perronnin et al. ECCV 2010
Dense trajectories,H. Wang et al. ICCV 2013
![Page 6: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/6.jpg)
66
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
Bag-of-visual-words,J. Gemert et al. TPAMI 2009
Fisher vector,F. Perronnin et al. ECCV 2010
Dense trajectories,H. Wang et al. ICCV 2013
Spatio-temporal pyramid,X. Yang et al. ECCV 2014
![Page 7: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/7.jpg)
77
INTELLIGENT VIDEO ANALYTICS Related Work
2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015
Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015
![Page 8: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/8.jpg)
88
OUR CONTRIBUTIONS
Overview of multilayer and multimodal fusion for video classification
Local feature extraction:
• Multilayer representations from CNN
Global feature representation:
• Multimodal representations
• Fusion by boosting
Temporal modeling:
• Structure of FC-RNN
![Page 9: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/9.jpg)
99
MULTILAYER REPRESENTATIONS
Dense image prediction
FCN by Long et al. FlowNet by Fischer et al.
![Page 10: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/10.jpg)
1010
MULTILAYER REPRESENTATIONS
Features of conv layers
Poses, parts, articulations, objects, etc.
Visualization by Zeiler et al.
![Page 11: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/11.jpg)
1111
MULTILAYER REPRESENTATIONS
Convert feature maps to feature descriptors
Feature maps of dimension 28×28×5
28×28 feature descriptors of dimension 5
![Page 12: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/12.jpg)
1212
MULTILAYER REPRESENTATIONS
Learn spatial discriminative weights of conv layers
Spatial information of conv layers to enhance representations
Video frames Feature maps of a conv layer over time
Spatial weights of a conv layer
import
ance
![Page 13: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/13.jpg)
1313
MULTILAYER REPRESENTATIONS
Aggregate feature descriptors by Fisher vector (FV)
Gaussian mixture modelFeature maps of a conv layer over time
![Page 14: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/14.jpg)
1414
MULTILAYER REPRESENTATIONS
Represent conv layers by improved Fisher vector (iFV)
Gaussian mixture modelFeature maps of a conv layer over time
Spatial weights of a conv layerim
port
ance
![Page 15: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/15.jpg)
1515
MULTILAYER REPRESENTATIONS
Represent conv layers by improved Fisher vector (iFV)
Represent fc layers by temporal max pooling
Overview of multilayer representation
![Page 16: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/16.jpg)
1616
FC-RNN STRUCTUREModeling Temporal Dynamics
Don’t be a hero—use pre-trained models
![Page 17: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/17.jpg)
1717
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D
![Page 18: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/18.jpg)
1818
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D VGG/C3D
fc layer
RNN
Standard RNN
![Page 19: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/19.jpg)
1919
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D VGG/C3D
fc layer
RNN
Standard RNN
VGG/C3D
fc layer
RNN
FC-RNN
![Page 20: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/20.jpg)
2020
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D VGG/C3D
fc layer
RNN
Standard RNN
VGG/C3D
FC-RNN
FC-RNN
![Page 21: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/21.jpg)
2121
FC-RNN STRUCTUREModeling Temporal Dynamics
RNN
FC-RNN
Pre-trained CNN, fc layer:
Transfer to recurrent layers
Comparison of standard RNN and FC-RNN
![Page 22: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/22.jpg)
2222
MULTIMODAL REPRESENTATIONS
Static and dynamic information
2D-CNN/3D-CNN with video frames/optical flow maps
A single frame
A single flow map
A buffer of frames
A buffer of flow maps
![Page 23: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/23.jpg)
2323
FUSION BY BOOSTING
Optimize a linear combination of predictions of multiple layers from multiple modalities
LPBoost:
boost-u: learn uniform weights for all classes
boost-c: learn class specific weights
![Page 24: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/24.jpg)
2424
FUSION BY BOOSTING
Optimize a linear combination of predictions of multiple layers from multiple modalities
LPBoost:
boost-u: learn uniform weights for all classes
boost-c: learn class specific weights
4 layers and 4 modalities M = 16
![Page 25: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/25.jpg)
2525
EXPERIMENTS
Benchmark datasets
UCF101: 13,320 videos in 101 classes
HMDB51: 6,766 videos in 51 classes
Skiing
Kissing
![Page 26: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/26.jpg)
2626
EXPERIMENTSFC-RNN
Outperforms RNN and LSTM by 3.0% and 2.9%
Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101
error rate
epochs
![Page 27: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/27.jpg)
2727
EXPERIMENTSFC-RNN
Outperforms RNN and LSTM by 3.0% and 2.9%
Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101
error rate
epochs
3 %
Up to
improvement
![Page 28: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/28.jpg)
2828
EXPERIMENTSFeature Aggregation
Comparison of FV and iFV to represent conv layers of different modalities
Spatial weights of a conv layer
import
ance
A single frame
A single flow map
A buffer of frames
A buffer of flow maps
![Page 29: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/29.jpg)
2929
EXPERIMENTSFeature Aggregation
Comparison of FV and iFV to represent conv layers of different modalities
Spatial weights of a conv layer
import
ance
A single frame
A single flow map
A buffer of frames
A buffer of flow maps
2.5 %
Up to
improvement
![Page 30: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/30.jpg)
3030
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
![Page 31: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/31.jpg)
3131
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
![Page 32: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/32.jpg)
3232
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
![Page 33: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/33.jpg)
3333
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
8 %
Up to
improvement
![Page 34: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/34.jpg)
3434
EXPERIMENTSMultimodal Fusion
Classification accuracy of different modalities and various combinations
Comparison to the state-of-the-art results
6 %
Up to
improvement
![Page 35: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/35.jpg)
3535
EXPERIMENTSLPBoost
17%
31%
23%
29%
0%
38%
12%
50%fc7
conv5
fc6
conv4
Modalities Layers
![Page 36: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/36.jpg)
3636
EXPERIMENTSEffect of Multimodal Fusion
SKIING SKIJET
skiing : )Multimodal Fusion
2D-CNN-SFskijet : (
![Page 37: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/37.jpg)
3737
EXPERIMENTSEffect of Multimodal Fusion
2D-CNN-OF boxing speeding bag : (
boxing punching bag : )
Multimodal Fusion
BOXING PUNCHING BAG BOXING SPEEDING BAG
![Page 38: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/38.jpg)
3838
OUR CONTRIBUTIONS
Local feature extraction:
• Multilayer representations from CNN
Global feature representation:
• Multimodal representations
• Fusion by boosting
Temporal modeling:
• Structure of FC-RNNOverview of multilayer and multimodal fusion for video classification
![Page 39: Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification](https://reader035.vdocuments.mx/reader035/viewer/2022062311/5e83a0689a7930331258d63d/html5/thumbnails/39.jpg)