cnn in image processing - didras.irdidras.ir/sites/default/files/4.2.image processing using...

CNN in Image ProcessingClassification, Detection and Retrieval

Ali Ahmadi

18th October, 2017

K.N.Toosi University of Technology

1

Definitions of Deep Learning

Classes of Deep Learning Networks

Various architectures in Deep Convolutional Neural

Networks

RCNN approach to object detection

DNN-based regression approach to object detection

Our proposed CBIR using a deep CNN (GoogLeNet)

2

Deep Structured Learning,

Or, Hierarchical Learning

Or, Deep Machine Learning

or more commonly called Deep Learning

Since 2006, it has emerged as a new area of

machine learning research.

3

Deep Learning is a branch of Machine Learning based on set of

algorithms that attempt to model high level abstractions in data by

using a deep graph with multiple processing layers, composed of

multiple linear an non-linear transformation.

A sub-field within machine learning that is based on algorithms for

learning multiple levels of representation in order to model

complex relationships among data.

◦ Higher-level features and concepts are defined in terms of lower-level

ones, and such a hierarchy of features is called a deep architecture.

◦ Most of these models are based on unsupervised learning of

representations.

4

It typically uses artificial neural networks.

Higher-level concepts are defined from lower-level

ones.

The same lower-level concepts can help to define

higher-level concepts.

5

Drastically increased chip processing

Significantly increased size of data used for

training

Recent advances in machine learning and

signal/information processing research

6

Image: object recognition, Object Detection, Image

De-Noising

Audio: speech recognition, music retrieval

1. Text: parsing, sentiment analysis, machine translation

7

Deep Neural Networks have recently shown great performance

on image classification.

we take another step and present some proposed methods for

object detection and semantic segmentation which can be

used for CBIR.

we propose an approach based on a well-known deep CNN

architecture, GoogLeNet.

8

11

Classes of Deep

Learning

Network

Deep networks

for supervised

learning

Hybrid deep

networks

Deep

Networks for

unsupervised

or generative

learning

Also called Discriminative Deep Networks.

Target label data are always available in direct or

indirect forms for such supervised learning.

Examples:

◦ Deep Neural Network (DNN)

◦ Convolutional Neural Network (CNN)

12

Also called Generative Deep Networks.

Used when no information about target class labels is available.

Captured high-order correlation of the observed or visible data

for pattern analysis.

Examples:

◦ Restricted Boltzmann Machines (RBM)

◦ Deep Boltzmann Machines (DBM)

◦ Deep Belief Networks (DBN)

13

Deep architecture that either comprises or makes use of both

generative and discriminative model components.

◦ This can be accomplished by better optimization or/ and

regularization of supervised deep networks.

◦ The generative component is mostly exploited to help with

discrimination, which is the final goal of the hybrid architecture.

14

Deep Boltzmann Machine (DBM)

Deep Belief Networks

Deep Neural Networks

AutoEncoders

Convolutional Deep Neural Networks

15

This architecture allows CNNs to take advantage of the 2D

structure of input data.

In comparison with other deep architectures, convolutional

neural networks have shown superior results in both image

and speech applications.

They can also be trained with standard back-propagation.

CNNs are easier to train than other regular, deep, feed-forward

neural networks and have many fewer parameters to estimate,

making them a highly attractive architecture to use

The most recent study on supervised learning for computer

vision shows that the deep CNN architecture is not only

successful for object/image classification but also successful

for object detection in the whole images

16

18

Image

Search

Learning effective feature representations and similarity

measures are critical to the performance of a CBIR.

Although various techniques have been proposed, it remains

one of the most challenging problems in CBIR, which is

mainly due to "semantic gap" issue that exists between low-

level image pixels captured by machine and high-level

semantic concepts perceived by human.

20

One of the most important advances in machine learning is

known as "deep learning" that attempts to model high-level

abstractions in data by employing deep architectures

composed of multiple non-linear transformations..

We can improve CBIR using the state-of-the-art deep learning

techniques for learning feature representations and similarity

measures.

21

Deep convolutional neural networks model pre-trained on

large-scale dataset can be straightly used for feature extraction

in new CBIR tasks and are able to capture high semantic

information in the raw pixels

The features extracted by pre-trained CNN model may or may

not be better than the traditional hand-crafted features, but

with proper feature refining schemes, the deep learning feature

representations consistently outperform convolutional hand-

crafted features on all datasets

22

When being applied for feature representation in a new

domain, similarity learning can further boost the retrieval

performance of the direct feature output of pre-trained deep

models.

By retraining the deep models with classification or similarity

learning objective on the new domain, the retrieval

performance could be boosted considerably which is much

better than the improvements made by shallow similarity

learning.

23

Deep learning framework for CBIR includes two stages

◦ Training a deep learning model from a large collection oftraining data.

◦ Learning feature representations of CBIR tasks in a newdomain by use of trained deep model

24

Feature extracted from the last fully connected layers in a deep

CNN-based model can be used as the feature representations

for any task such as classification, detection, and CBIR.

In CBIR, we do not consider features from lower

convolutional layers in the network since the lower layers are

in lack of rich semantic representations.

The features extracted from last convolutional layer and fully

connected layers are significant features and we can make use

of these features for training tasks such as object localization,

object detection, and specially image retrieval.

25

AlexNet (2012)

ZF Net (2013)

26

SPP (2014)

VGG (2014)

27

GoogLeNet (2014)

28

AlexNet is proposed in paper, titled “ImageNet Classification with

Deep Convolutional Networks” in 2012. this paper has been cited a

total of 6184 times.

AlexNet is a deep convolutional neural network to classify the 1.2

million high-resolution images in ImageNet ILSVRC-2010 contest

into the 1000 different classes.

ILSVRC: ImageNet Large Scale Visual Recognition Competition

On the test data, AlexNet achieved top-1 and top-5 error rates of

37.5% and 15.4% which is considerably better than the previous

state-of-the-art.

AlexNet has 60 million parameters and 650.000 neurons.

AlexNet layers:

◦ Five convolutional layers, Max-pooling layers, Dropout layers, Three

fully connected layers

29

Trained the network on ImageNet data, which contained over

15 million annotated images from a total of over 22.000

categories.

Used ReLU for the nonlinearity functions (Found to decrease

training time as ReLUs are several times faster than the

convolutional tanh function)

Used data augmentation techniques that consisted of image

translations horizontal reflections, and patch extractions.

Implemented dropout layers in order to combat the problem of

overfitting to the training data.

Trained the model using batch stochastic gradient descent,

with specific values for momentum and weight decay.

Trained on two GTX 580 GPUs for five to six days.

32

Zeiler-Fergus Net (ZF Net) is the winner of the

competition in 2013.

ZF Net achieved 11.7% top-5 error rate.

This architecture was more of a fine tuning to

the previous AlexNet structure, but still

developed some very keys ideas about

improving performance.

34

As the network grows, we also see a rise in the number of

filters used.

Used ReLUs for their activation functions, cross-entropy loss

for the error function, and trained using batch stochastic

gradient descent.

Trained on a GTX 580 GPU for twelve days.

36

VGG Net was proposed by visual Geometry Group,

Department of Engineering Science, University of

Oxford in ILSVRC 2014.

Simplicity and Depth

weren’t the Winner of ILSVRC 2014

VGG Net achieved 7.3% top-5 error rate.

37

Worked well on both image classification

and localization tasks.

Built model with Caffe toolbox.

Used ReLU layers after each conv layer

and trained with batch gradient descent.

Trained on 4 Nvidia Titan Black GPUs

for two to three weeks.

40

Two well-known object detection approaches based on deep

convolutional neural networks:

◦ RCNN (Regions with CNN features)

◦ DNN-based regression

43

RCNN includes five stages◦ Stage 1: Determining object proposals without considering the

category of image.

◦ Stage 2: Extracting a fixed-length feature vector from eachwarped proposal using CNN.

◦ Stage 3: Training a set of classifier linear SVMs.

◦ Stage 4: Ranking the proposals and using Non-MaximumSuppression to get the bounding boxes.

◦ Stage 5: Using bounding box regression to augment localization

performance.

44

DeepID-Net, a RCNN-based method improves the result of

the RCNN framework by use of deformation models of object

parts and multi-stage training.

The stages of DeepID-Net:

46

DeepID-Net consists of four parts:◦ Part1: the baseline deep model.

◦ Part2: the layers with multi-stage training.

◦ Part 3: the layers with variable filter sizes and def-pooling layer.

◦ Part4: the deep model for obtaining 1000-class imageclassification scores.

47

DNN-based regression for object detection is presented as a regression

problem to get object bounding box masks.

Methods that uses DNN-based regression approach for object detection,

define a multi-scale inference procedure which is able to produce high-

resolution object detections.

This regression uses architecture of a deep CNN and changes the last fully

connected layer or both last fully connected and last convolutional layer.

48

Overfeat is other algorithm that uses DNN-based regression for

classification, localization and detection.

This integrated framework is the winner of the localization task of the

ILSVRC2013 and obtained very competitive results for the detection and

classification tasks.

In This algorithm, multi-scale and sliding window approach is efficiently

implemented by DNN-based regression.

Overfeat accumulates bounding boxes in order to increase detection

confidence.

50

overfeat explores the entire image by densely running the

network at each location and multiple scale

This approach yields significantly more views for voting,

which increases robustness and efficiencyles.

The result of convolving a ConvNet on an image of arbitrary

size is a spatial map of C-dimensional vectors at each scale.

Overfeat uses 6 scales of input which result in unspooled layer

5 maps of varying resolution.

51

We can change overfeat from a detection task to a

CBIR task by changing definition of labeling training

data and changing the last layers in overfeat and train

the modified architecture for CBIR.

In this case, similar to overfeat for detection, we can

get a spatial map of C-dimensional vectors at each

scale and then combine them to do CBIR task.

52

differences◦ In RCNN-based approach, we classify images using shallow methods

such as linear SVM in order to enhance the classification and reduce

object localization error. In contrast, there is no shallow classifier in

DNN-based regression approach.

◦ In RCNN-based approach, the input of Deep CNN is some object

proposals while in DNN-based regression approach the input of Deep

CNN is the entire images and densely sliding windows is applied on the

image. Using object proposal algorithms in RCNN-based approach,

increases the speed of inference and using densely sliding windows in

DNN-based regression approach, increases the precision of this

approach.

53

Similarities

◦ Both approaches use the features of pool5 layer and fully connected

layers for detection and semantic segmentation. They may feed other

networks or classifiers using these features.

◦ Both approaches may modify the last layer and adjust it with detection

and semantic segmentation tasks. In fact, the classifier layers are

replaced by a regression network and trained to predict object bounding

boxes.

◦ Both approaches make use of multi-scale image in order to increase

detection and semantic segmentation precision. The better aligned the

network window and the object, the strongest the confidence of the

network response.

54

Our approach for CBIR is based on GoogLeNet [17]

architecture.

In our proposed CBIR, we compare images based on

the features extracted from pre-trained GoogLeNet.

Actually we choose to increase performance of our

proposed CBIR, because we can extract deeper

feature maps.

We consider the output of the last convolutional layer

as image features to find similar images based on

these feature maps.

55

We make use of Caffe to implement and extract last

convolutional layers feature maps of pre-trained

GoogLeNet.

In our proposed CBIR, GoogLeNet receives input

images via a RappidMQ queue cluster. Then the

feature representations extracted from GoogLeNet

are placed in target queue. We implement our CBIR

on GPU GForce GTX 1080.

56

Stage 1: Read the encoded input images from RappidMQ

queue.

Stage 2: Decode the images to a readable structure for Caffe.

Stage 3: Feed forward the images to the pre-trained GooLeNet

in Caffe.

Stage 4: Get the output of the last pooling layer as the feature

vector.

Stage 5: Encode the feature vectors to a proper format for

queue.

Stage 6: Put the final encoded vectors into target queue.

57

We reviewed some CNN architectures in Image classification.

We presented some object detection and semantic segmentation

algorithms based on deep convolutional neural network, which can

be used for CBIR.

we can change detection task to a CBIR task by changing definition

of labeling training data and changing the last layers in Deep CNN

and train the modified architecture for CBIR.

There are two well-known semantic segmentation and object

detection approaches based on deep convolutional neural networks:

RCNN-based approach and DNN-based regression approach. These

approaches have some similarities and differences.

In our proposed CBIR, we compare images based on the features

extracted from pre-trained GoogLeNet. In fact, GoogLeNet receives

input images via a RappidMQ queue cluster. Then the feature

representations extracted from GoogLeNet are placed in target

queue.

58

[1] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma, "A survey of content-based image retrieval

with high-level semantics" Pattern Recogn, 40(1): 262-283, January 2007.

[2] Y. Cao, C. Wang, L. Zhang, and L. Zhang, "Edge index for large scale sketch-based image search", in:

IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 761-768.

[3] J. Xie, Y. Fang, F. Zhu, and E. Wong, "Deepshape: Deep learned shape descriptor for 3d shape matching

and retrieval", in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1275-1283.

[4] F. Wang, L. Kang, and Y. Li, "Sketch-based 3d shape retrieval using convolutional neural networks", in:IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1875-1883.

[5] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki, "Gift: A real-time and scalable 3d shape search

engine", in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5023-5032.

[6] M. Park, J. S. Jin, and L. S. Wilson, "Fast content-based image retrieval using quasi-gabor filter andreduction of image feature dimension", in: IEEE Southwest Symposium on Image Analysis andInterpretation. IEEE, 2002, pp. 178-182.

[7] X.-Y. Wang, B.-B. Zhang, and H.-Y. Yang, "Content-based image retrieval by integrating color andtexture features", Multimedia Tools and Applications (MTA), vol. 68, no. 3, pp. 545-569, 2014.

[8] J. Wang and X.-S. Hua, "Interactive image search by color map", ACM Transactions on IntelligentSystems and Technology (TIST), vol. 3, no. 1, p. 12, 2011.

[9] C. Wengert, M. Douze, and H. Jegou, "Bag-of-colors for improved image search", in: ACM International Conference on Multimedia, ACM, 2011, pp. 1437-1440.

[10] B. Wang, Z. Li, M. Li, and W.-Y. Ma, "Large-scale duplicate detection for web image search", in:

IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2006, pp. 353-356.

59

[11] J. Wan, D. Wang, S. Hoi, et al., "Deep Learning for content-based image retrieval: a comprehensive study", in: Proceeding of the Multimedia, 2014.

[12] Ji. Wan, D. Wang, S.C.H. HOI, P. Wu, J. Zhu, "Deep learning for content-based image retrieval: A

comprehensive study", in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014.

[13] A. Krizhevsky, I. Sutskever, G.E. Hinton, "Imagenet classification with deep convolutional neural

networks", in: Proceedings of the NIPS, 2012.

[14] M.D. Zeiler, R. Fergus, "Visualizing and understanding convolutional neural networks", in:

Proceedings of the ECCV, 2014.

[15] K. He, X. Zhang, S. Ren, et al., "Spatial pyramid pooling in deep convolutional networks for visual

recognition", in: Proceedings of the ECCV, 2014.

[16] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition",

in: Proceedings of the ICLR, 2015.

[17] C. Szegedy, W. Liu, Y. Jia, et al., "Going deeper with convolutions", in: Proceedings of the CVPR,

2015.

[18] O. Russakovsky, J. Deng, H. Su, et al., "Imagenet large scale visual recognition challenge", int, J.

Comput, Vis. 115 (3) (2015) 211-252.

[19] R. Girshick, J. Donahue, T. Darrell, et al., "Rich feature hierarchies for accurate object detection and

semantic segmentation", in: Proceedings of the CVPR, 2014.

[20] W. Ouyang, P. Luo, X. Zeng, et al., "DeepID-Net: multi-stage and deformable deep convolutional

neural networks for object detection", in: Proceedings of the CVPR, 2015.

60

[21] R. Grishick, "Fast R-CNN", in: Proceedings of the ICCV, 2015.

[22] S. Ren, K. He, R. Girshick, et al., "Faster R-CNN: towards real-time object detection with region proposal

networks", in: Proceedings of the NIPS, 2015.

[23] Y. Zhu, R. Salakhutdinov, et al., "segDeepM: exploiting segmentation and context in deep neural networks for

object detection", in: Proceedings of the CVPR, 2015.

[24] S. Gidaris, N. Komodakis, "object detection via a multi-region and semantic segmentation-aware CNN model",

in: Proceedings of the ICCV, 2015.

[25] C. Szegedy, A. Toshev, D. Erhan, "Deep neural networks for object detection", in: Proceedings of the NIPS,

2013.

[26] P. Sermanent, D. Eigen, X. Zhang, et al., "Overfeat: integrated recognition, localization and detection using

convolutional networks", in: Proceedings of the ICLR, 2014.

[27] D. Erhan, C. Szegedy, A. Toshev, et al., "Scalable object detection using deep neural networks", in:

Proceedings of the CVPR, 2014.

[28] B. Alexe, T. Deselaers, V. Ferrari, "Measuring the objectness of image windows", Pattern Anal. Mach. Intell.

IEEE Trans. 34 (11) (2012) 2189-2202.

[29] J.R.R Uijlings, K.E.A van de Sande, T. Gevers, et al., "Selective search for object recognition", Int. J. Comput.

Vis. 104 (2) (2013) 154-171.

[30] I. Endres, D. Hoiem, "Category independent object proposals", in: "Proceedings of the ECCV, 2010.

[31] M.M. Cheng, Z. Zhang, W.Y. Lin, et al., "BING: binarized normed gradients for objectness estimation at

300fps", in: Proceedings of the CVPR, 2014.

[32] C.L. Zitnick, P. Dollar, "Edge boxes: locating object proposals from edges", in: Proceedings of the ECCV,

2014.

[33] J. Hosang, R. Benenson, B. Schiele, "How good are detection proposals, really?", in: Proceedings of the

BMVC, 2014.

61

Thank you so much

Any Question?

For object localization, overfeat replaces the classifier layers

by a regression network and trains it to predict bounding boxes

at each spatial location and scale.

It then combines the regression predictions together, along

with the classification results at each location.

overfeat simultaneously runs the classifier and regressor

networks across all locations and scales.

The output of the final softmax layer for a class c at each

location provides a score of confidence that an object of class

c is present in the corresponding field of view. So it is possible

to assign a confidence to each bounding box.

63

cnn in image processing - didras.irdidras.ir/sites/default/files/4.2.image processing using...

Documents