deep learning for team image understanding - ipal › ... › aura_deeplearning_poster.pdf ·...

Post on 07-Jun-2020

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Please Write Here The Title of This Poster Please Write Here the Different Authors of this Poster

Image & Pervasive Access Lab CNRS UMI 2955 - Singapore www.ipal.cnrs.fr

Deep Learning for Image Understanding

Olivier Morère1, Julie Petta2, Jie Lin3, Vijay Chandrasekhar3, Antoine Veillard1

1Université Pierre et Marie Curie, 2Supélec, 3A-Star Institute for Infocomm Research

Team Web & Data

Science

Image Classification Video Summarization

Compact Image Representations for Image Similarity Search

Convolutional Neural Networks

or4K

dim.

orFisher Vector

Deep Convolutional Neural Network

Input Image

Training Phase 2: Fine-Tuning

Global Feature Extraction

8K-64Kdim.

Stacked Regularized RBMs

W1 W2 WL. . .

Training Phase 1: Unsupervised

W1 W2 WL. . .

Loss1 LossL

W1 W2 WL. . .

Loss2

Deep Siamese Network

Trained DeepHash Model

. . .

Image DescriptorHashing(Testing)

W1 W2 WL

Compact Binary Hash

64-1K bits

Matching &non-matching

pairs

High-dimensionalImage Descriptor

Transfer model

Training

Testing

↵=1 ↵=00<↵<1

More subject-centric More scene-centric

�����

�����

� ��

������

���� ��

�������������

�������

� ��

�������

� ��

�������������

������

���� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

������

���� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

"����#�����!!�

� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

"����#�����!!�

� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

������

���� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

�������

� ��

�������

� ������

���� ��

������

���� ��

���� ������

�������

� ������

!!�� ��

�������

� ��

"����#������

� ��

$�

�������

� ��

$� $�

��%���"���������

��%���&

�������

� ��

$� $�

��%���"���������

��%����

��%���"���������

��%����

GoogLeNet [Szegedy et al., 2014]

[Simonyan & Zisserman, 2014] Oxford VGG

Input Image

Co

nv-64

Ma

xPoo

l

Co

nv-64

FC-4096

Co

nv-128

Ma

xPoo

l

Co

nv-128

Co

nv-256

Ma

xPoo

l

Co

nv-256

Co

nv-512

Ma

xPoo

l

Co

nv-512

Co

nv-512

Ma

xPoo

l

Co

nv-512

FC-4096

FC-1000

Softm

ax

Softmax Loss

[Krizhevsky et al., 2012; Zeiler & Fergus, 2013] AlexNet / Clarifai

Input Image

Co

nv

Ma

xPoo

l

No

rma

lize

Co

nv

Ma

xPoo

l

No

rma

lize

Co

nv

Co

nv

Co

nv

Ma

xPoo

l

FC

FC

FC

Softm

ax

Softmax Loss

ImageNet 2014 Challenge LIMITED RESOURCES •  NVIDIA GTX580 (1.5GB Memory) •  Two-Month Effort

OPTIMIZATION •  Multi-Crop Pooling •  Model Fusion

RESULTS

CNN MODEL 1

Multiple Crops

CNN

CNN

CNN

CNN

Pooling

12.1%

Pooled Scores

CNN MODEL 2

. . .

CNN MODEL N

Model Fusion

Fused Scores

11.4%

CNN

QUERY IMAGE 15.4%

Learning Multimodal Representations

Tunable Automatic Video Summaries

For  each  video,  a  compact  and  mul3modal  subject-­‐scene  subspace  is  learnt  from  high-­‐dimensional  CNN  descriptors  using  novel  unsupervised  deep  learning  methods.  

The  mul3modal  representa3ons  are  used  to  automa3cally  generate  compact  summaries  from  videos.  Subject-­‐scene  centricity  can  be  tuned  with  a  single  parameter.    

DEEPHASH • Binary descriptors (hash) from images • Unsupervised and supervised deep learning pipelines • Application to image similarity search

RESULTS • Very compact binary descriptors in the 32-1024 bits range • State-of-the-art retrieval results on many publicly available datasets • Enabling similarity search from internet-scale databases

Automa3c  image  understanding  with  human-­‐like  accuracy  is  the  new  fron3er  of  ar3ficial  intelligence  research  and  deep  learning  neural  nets  are   front-­‐running  the  race.  While  striving  to  reach  and  maintain  state-­‐of-­‐the-­‐art  performance   in   large-­‐scale   image  classifica3on,  the  deep  learning  group  at  IPAL  is  also  exploring  how  the  deep  image  models  can  be  used  to  push  the  limits  in  various  other  fields  of  applica3on  such  as  image  compression,  similarity-­‐based  image  search  and  automa3c  video  summariza3on.  Feel  free  to  approach  us  for  demos!  

Latent subjectspace

Latent scenespace

DCNN subject descriptor

DCNN scenedescriptor

RBM RBMSceneDCNN

SubjectDCNN

Regularize with scene

16 Layers138M parameters

8 Layers60M parameters

Regularize with subjects

top related