deep learning for team image understanding - ipal › ... › aura_deeplearning_poster.pdf ·...
TRANSCRIPT
Please Write Here The Title of This Poster Please Write Here the Different Authors of this Poster
Image & Pervasive Access Lab CNRS UMI 2955 - Singapore www.ipal.cnrs.fr
Deep Learning for Image Understanding
Olivier Morère1, Julie Petta2, Jie Lin3, Vijay Chandrasekhar3, Antoine Veillard1
1Université Pierre et Marie Curie, 2Supélec, 3A-Star Institute for Infocomm Research
Team Web & Data
Science
Image Classification Video Summarization
Compact Image Representations for Image Similarity Search
Convolutional Neural Networks
or4K
dim.
orFisher Vector
Deep Convolutional Neural Network
Input Image
Training Phase 2: Fine-Tuning
Global Feature Extraction
8K-64Kdim.
Stacked Regularized RBMs
W1 W2 WL. . .
Training Phase 1: Unsupervised
W1 W2 WL. . .
Loss1 LossL
W1 W2 WL. . .
Loss2
Deep Siamese Network
Trained DeepHash Model
. . .
Image DescriptorHashing(Testing)
W1 W2 WL
Compact Binary Hash
64-1K bits
Matching &non-matching
pairs
High-dimensionalImage Descriptor
Transfer model
Training
Testing
↵=1 ↵=00<↵<1
More subject-centric More scene-centric
�����
�����
� ��
������
���� ��
�������������
�������
� ��
�������
� ��
�������������
������
���� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
������
���� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
"����#�����!!�
� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
"����#�����!!�
� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
������
���� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
�������
� ��
�������
� ������
���� ��
������
���� ��
���� ������
�������
� ������
!!�� ��
�������
� ��
"����#������
� ��
$�
�������
� ��
$� $�
��%���"���������
��%���&
�������
� ��
$� $�
��%���"���������
��%����
��%���"���������
��%����
GoogLeNet [Szegedy et al., 2014]
[Simonyan & Zisserman, 2014] Oxford VGG
Input Image
Co
nv-64
Ma
xPoo
l
Co
nv-64
FC-4096
Co
nv-128
Ma
xPoo
l
Co
nv-128
Co
nv-256
Ma
xPoo
l
Co
nv-256
Co
nv-512
Ma
xPoo
l
Co
nv-512
Co
nv-512
Ma
xPoo
l
Co
nv-512
FC-4096
FC-1000
Softm
ax
Softmax Loss
[Krizhevsky et al., 2012; Zeiler & Fergus, 2013] AlexNet / Clarifai
Input Image
Co
nv
Ma
xPoo
l
No
rma
lize
Co
nv
Ma
xPoo
l
No
rma
lize
Co
nv
Co
nv
Co
nv
Ma
xPoo
l
FC
FC
FC
Softm
ax
Softmax Loss
ImageNet 2014 Challenge LIMITED RESOURCES • NVIDIA GTX580 (1.5GB Memory) • Two-Month Effort
OPTIMIZATION • Multi-Crop Pooling • Model Fusion
RESULTS
CNN MODEL 1
Multiple Crops
CNN
CNN
CNN
CNN
Pooling
12.1%
Pooled Scores
CNN MODEL 2
. . .
CNN MODEL N
Model Fusion
Fused Scores
11.4%
CNN
QUERY IMAGE 15.4%
Learning Multimodal Representations
Tunable Automatic Video Summaries
For each video, a compact and mul3modal subject-‐scene subspace is learnt from high-‐dimensional CNN descriptors using novel unsupervised deep learning methods.
The mul3modal representa3ons are used to automa3cally generate compact summaries from videos. Subject-‐scene centricity can be tuned with a single parameter.
DEEPHASH • Binary descriptors (hash) from images • Unsupervised and supervised deep learning pipelines • Application to image similarity search
RESULTS • Very compact binary descriptors in the 32-1024 bits range • State-of-the-art retrieval results on many publicly available datasets • Enabling similarity search from internet-scale databases
Automa3c image understanding with human-‐like accuracy is the new fron3er of ar3ficial intelligence research and deep learning neural nets are front-‐running the race. While striving to reach and maintain state-‐of-‐the-‐art performance in large-‐scale image classifica3on, the deep learning group at IPAL is also exploring how the deep image models can be used to push the limits in various other fields of applica3on such as image compression, similarity-‐based image search and automa3c video summariza3on. Feel free to approach us for demos!
Latent subjectspace
Latent scenespace
DCNN subject descriptor
DCNN scenedescriptor
RBM RBMSceneDCNN
SubjectDCNN
Regularize with scene
16 Layers138M parameters
8 Layers60M parameters
Regularize with subjects