deep learning for team image understanding - ipal › ... › aura_deeplearning_poster.pdf ·...

Please Write Here The Title of This Poster Please Write Here the Different Authors of this Poster

Image & Pervasive Access Lab CNRS UMI 2955 - Singapore www.ipal.cnrs.fr

Deep Learning for Image Understanding

Olivier Morère1, Julie Petta2, Jie Lin3, Vijay Chandrasekhar3, Antoine Veillard1

1Université Pierre et Marie Curie, 2Supélec, 3A-Star Institute for Infocomm Research

Team Web & Data

Science

Image Classification Video Summarization

Compact Image Representations for Image Similarity Search

Convolutional Neural Networks

or4K

dim.

orFisher Vector

Deep Convolutional Neural Network

Input Image

Training Phase 2: Fine-Tuning

Global Feature Extraction

8K-64Kdim.

Stacked Regularized RBMs

W1 W2 WL. . .

Training Phase 1: Unsupervised

W1 W2 WL. . .

Loss1 LossL

W1 W2 WL. . .

Loss2

Deep Siamese Network

Trained DeepHash Model

. . .

Image DescriptorHashing(Testing)

W1 W2 WL

Compact Binary Hash

64-1K bits

Matching &non-matching

pairs

High-dimensionalImage Descriptor

Transfer model

Training

Testing

↵=1 ↵=00<↵<1

More subject-centric More scene-centric

��

��

� ��

��

��

��

��

� ��

��

� ��

��

��

��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

��

��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

��

� ��

��

� ��

��

��

��

"��#��!!�

� ��

��

��

� ��

!!��

��

� ��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

��

� ��

��

� ��

��

��

��

"��#��!!�

� ��

��

��

� ��

!!��

��

� ��

��

��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

��

� ��

��

� ��

��

��

��

��

��

� ��

!!��

��

� ��

"��#��

� ��

$�

��

� ��

$� $�

��%��"��

��%��&

��

� ��

$� $�

��%��"��

��%��

��%��"��

��%��

GoogLeNet [Szegedy et al., 2014]

[Simonyan & Zisserman, 2014] Oxford VGG

Input Image

Co

nv-64

Ma

xPoo

l

Co

nv-64

FC-4096

Co

nv-128

Ma

xPoo

l

Co

nv-128

Co

nv-256

Ma

xPoo

l

Co

nv-256

Co

nv-512

Ma

xPoo

l

Co

nv-512

Co

nv-512

Ma

xPoo

l

Co

nv-512

FC-4096

FC-1000

Softm

ax

Softmax Loss

[Krizhevsky et al., 2012; Zeiler & Fergus, 2013] AlexNet / Clarifai

Input Image

Co

nv

Ma

xPoo

l

No

rma

lize

Co

nv

Ma

xPoo

l

No

rma

lize

Co

nv

Co

nv

Co

nv

Ma

xPoo

l

FC

FC

FC

Softm

ax

Softmax Loss

ImageNet 2014 Challenge LIMITED RESOURCES •  NVIDIA GTX580 (1.5GB Memory) •  Two-Month Effort

OPTIMIZATION •  Multi-Crop Pooling •  Model Fusion

RESULTS

CNN MODEL 1

Multiple Crops

CNN

CNN

CNN

CNN

Pooling

12.1%

Pooled Scores

CNN MODEL 2

. . .

CNN MODEL N

Model Fusion

Fused Scores

11.4%

CNN

QUERY IMAGE 15.4%

Learning Multimodal Representations

Tunable Automatic Video Summaries

For each video, a compact and mul3modal subject-‐scene subspace is learnt from high-‐dimensional CNN descriptors using novel unsupervised deep learning methods.

The mul3modal representa3ons are used to automa3cally generate compact summaries from videos. Subject-‐scene centricity can be tuned with a single parameter.

DEEPHASH • Binary descriptors (hash) from images • Unsupervised and supervised deep learning pipelines • Application to image similarity search

RESULTS • Very compact binary descriptors in the 32-1024 bits range • State-of-the-art retrieval results on many publicly available datasets • Enabling similarity search from internet-scale databases

Automa3c image understanding with human-‐like accuracy is the new fron3er of ar3ficial intelligence research and deep learning neural nets are front-‐running the race. While striving to reach and maintain state-‐of-‐the-‐art performance in large-‐scale image classifica3on, the deep learning group at IPAL is also exploring how the deep image models can be used to push the limits in various other fields of applica3on such as image compression, similarity-‐based image search and automa3c video summariza3on. Feel free to approach us for demos!

Latent subjectspace

Latent scenespace

DCNN subject descriptor

DCNN scenedescriptor

RBM RBMSceneDCNN

SubjectDCNN

Regularize with scene

16 Layers138M parameters

8 Layers60M parameters

Regularize with subjects

deep learning for team image understanding - ipal › ... › aura_deeplearning_poster.pdf ·...

Documents