deep learning for team image understanding - ipal › ... › aura_deeplearning_poster.pdf ·...

1
Please Write Here The Title of This Poster Please Write Here the Different Authors of this Poster Image & Pervasive Access Lab CNRS UMI 2955 - Singapore www.ipal.cnrs.fr Deep Learning for Image Understanding Olivier Morère 1 , Julie Petta 2 , Jie Lin 3 , Vijay Chandrasekhar 3 , Antoine Veillard 1 1 Université Pierre et Marie Curie, 2 Supélec, 3 A-Star Institute for Infocomm Research Team Web & Data Science Image Classification Video Summarization Compact Image Representations for Image Similarity Search Convolutional Neural Networks or 4K dim. or Fisher Vector Deep Convolutional Neural Network Input Image Training Phase 2: Fine-Tuning Global Feature Extraction 8K-64K dim. Stacked Regularized RBMs W 1 W 2 W L . . . Training Phase 1: Unsupervised W 1 W 2 W L . . . Loss 1 Loss L W 1 W 2 W L . . . Loss 2 Deep Siamese Network Trained DeepHash Model . . . Image Descriptor Hashing (Testing) W 1 W 2 W L Compact Binary Hash 64-1K bits Matching & non-matching pairs High-dimensional Image Descriptor Transfer model Training Testing =1 =0 0 < < 1 More subject-centric More scene-centric !! !! !! "# !! !! !! !! "# !! !! !! !! "# $ $ $ %" %& $ $ %" % %" % GoogLeNet [Szegedy et al., 2014] [Simonyan & Zisserman, 2014] Oxford VGG Input Image Conv-64 MaxPool Conv-64 FC-4096 Conv-128 MaxPool Conv-128 Conv-256 MaxPool Conv-256 Conv-512 MaxPool Conv-512 Conv-512 MaxPool Conv-512 FC-4096 FC-1000 Softmax Softmax Loss [Krizhevsky et al., 2012; Zeiler & Fergus, 2013] AlexNet / Clarifai Input Image Conv MaxPool Normalize Conv MaxPool Normalize Conv Conv Conv MaxPool FC FC FC Softmax Softmax Loss ImageNet 2014 Challenge LIMITED RESOURCES NVIDIA GTX580 (1.5GB Memory) Two-Month Effort OPTIMIZATION Multi-Crop Pooling Model Fusion RESULTS CNN MODEL 1 Multiple Crops CNN CNN CNN CNN Pooling 12.1% Pooled Scores CNN MODEL 2 . . . CNN MODEL N Model Fusion Fused Scores 11.4% CNN QUERY IMAGE 15.4% Learning Multimodal Representations Tunable Automatic Video Summaries For each video, a compact and mul3modal subjectscene subspace is learnt from high dimensional CNN descriptors using novel unsupervised deep learning methods. The mul3modal representa3ons are used to automa3cally generate compact summaries from videos. Subjectscene centricity can be tuned with a single parameter. DEEPHASH Binary descriptors (hash) from images Unsupervised and supervised deep learning pipelines Application to image similarity search RESULTS Very compact binary descriptors in the 32-1024 bits range State-of-the-art retrieval results on many publicly available datasets Enabling similarity search from internet-scale databases Automa3c image understanding with humanlike accuracy is the new fron3er of ar3ficial intelligence research and deep learning neural nets are frontrunning the race. While striving to reach and maintain stateoftheart performance in largescale image classifica3on, the deep learning group at IPAL is also exploring how the deep image models can be used to push the limits in various other fields of applica3on such as image compression, similaritybased image search and automa3c video summariza3on. Feel free to approach us for demos! Latent subject space Latent scene space DCNN subject descriptor DCNN scene descriptor RBM RBM Scene DCNN Subject DCNN Regularize with scene 16 Layers 138M parameters 8 Layers 60M parameters Regularize with subjects

Upload: others

Post on 07-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Team Image Understanding - IPAL › ... › aura_deeplearning_poster.pdf · Compact Image Representations for Image Similarity Search Convolutional Neural Networks

Please Write Here The Title of This Poster Please Write Here the Different Authors of this Poster

Image & Pervasive Access Lab CNRS UMI 2955 - Singapore www.ipal.cnrs.fr

Deep Learning for Image Understanding

Olivier Morère1, Julie Petta2, Jie Lin3, Vijay Chandrasekhar3, Antoine Veillard1

1Université Pierre et Marie Curie, 2Supélec, 3A-Star Institute for Infocomm Research

Team Web & Data

Science

Image Classification Video Summarization

Compact Image Representations for Image Similarity Search

Convolutional Neural Networks

or4K

dim.

orFisher Vector

Deep Convolutional Neural Network

Input Image

Training Phase 2: Fine-Tuning

Global Feature Extraction

8K-64Kdim.

Stacked Regularized RBMs

W1 W2 WL. . .

Training Phase 1: Unsupervised

W1 W2 WL. . .

Loss1 LossL

W1 W2 WL. . .

Loss2

Deep Siamese Network

Trained DeepHash Model

. . .

Image DescriptorHashing(Testing)

W1 W2 WL

Compact Binary Hash

64-1K bits

Matching &non-matching

pairs

High-dimensionalImage Descriptor

Transfer model

Training

Testing

↵=1 ↵=00<↵<1

More subject-centric More scene-centric

�����

�����

� ��

������

���� ��

�������������

�������

� ��

�������

� ��

�������������

������

���� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

������

���� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

"����#�����!!�

� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

"����#�����!!�

� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

������

���� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

"����#������

� ��

$�

�������

� ��

$� $�

��%���"���������

��%���&

�������

� ��

$� $�

��%���"���������

��%����

��%���"���������

��%����

GoogLeNet [Szegedy et al., 2014]

[Simonyan & Zisserman, 2014] Oxford VGG

Input Image

Co

nv-64

Ma

xPoo

l

Co

nv-64

FC-4096

Co

nv-128

Ma

xPoo

l

Co

nv-128

Co

nv-256

Ma

xPoo

l

Co

nv-256

Co

nv-512

Ma

xPoo

l

Co

nv-512

Co

nv-512

Ma

xPoo

l

Co

nv-512

FC-4096

FC-1000

Softm

ax

Softmax Loss

[Krizhevsky et al., 2012; Zeiler & Fergus, 2013] AlexNet / Clarifai

Input Image

Co

nv

Ma

xPoo

l

No

rma

lize

Co

nv

Ma

xPoo

l

No

rma

lize

Co

nv

Co

nv

Co

nv

Ma

xPoo

l

FC

FC

FC

Softm

ax

Softmax Loss

ImageNet 2014 Challenge LIMITED RESOURCES •  NVIDIA GTX580 (1.5GB Memory) •  Two-Month Effort

OPTIMIZATION •  Multi-Crop Pooling •  Model Fusion

RESULTS

CNN MODEL 1

Multiple Crops

CNN

CNN

CNN

CNN

Pooling

12.1%

Pooled Scores

CNN MODEL 2

. . .

CNN MODEL N

Model Fusion

Fused Scores

11.4%

CNN

QUERY IMAGE 15.4%

Learning Multimodal Representations

Tunable Automatic Video Summaries

For  each  video,  a  compact  and  mul3modal  subject-­‐scene  subspace  is  learnt  from  high-­‐dimensional  CNN  descriptors  using  novel  unsupervised  deep  learning  methods.  

The  mul3modal  representa3ons  are  used  to  automa3cally  generate  compact  summaries  from  videos.  Subject-­‐scene  centricity  can  be  tuned  with  a  single  parameter.    

DEEPHASH • Binary descriptors (hash) from images • Unsupervised and supervised deep learning pipelines • Application to image similarity search

RESULTS • Very compact binary descriptors in the 32-1024 bits range • State-of-the-art retrieval results on many publicly available datasets • Enabling similarity search from internet-scale databases

Automa3c  image  understanding  with  human-­‐like  accuracy  is  the  new  fron3er  of  ar3ficial  intelligence  research  and  deep  learning  neural  nets  are   front-­‐running  the  race.  While  striving  to  reach  and  maintain  state-­‐of-­‐the-­‐art  performance   in   large-­‐scale   image  classifica3on,  the  deep  learning  group  at  IPAL  is  also  exploring  how  the  deep  image  models  can  be  used  to  push  the  limits  in  various  other  fields  of  applica3on  such  as  image  compression,  similarity-­‐based  image  search  and  automa3c  video  summariza3on.  Feel  free  to  approach  us  for  demos!  

Latent subjectspace

Latent scenespace

DCNN subject descriptor

DCNN scenedescriptor

RBM RBMSceneDCNN

SubjectDCNN

Regularize with scene

16 Layers138M parameters

8 Layers60M parameters

Regularize with subjects