optimization of convolutional neural networks: transfer ... · optimization of convolutional neural...

102
Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective Filter-Level Fine-Tuning Alessandro Bianchi Student Id: 875035 Moreno Raimondo Vendra Student Id: 877265 Supervisor: Prof. Marco Brambilla Advisor: Prof. Pavlos Protopapas Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano This thesis is submitted for the Master of Science in Computer Science and Engineering April 2019

Upload: others

Post on 29-May-2020

46 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Optimization of Convolutional NeuralNetworks: Transfer Learning for

Robustness to Image Distortion throughSelective Filter-Level Fine-Tuning

Alessandro BianchiStudent Id: 875035

Moreno Raimondo VendraStudent Id: 877265

Supervisor: Prof. Marco Brambilla

Advisor: Prof. Pavlos Protopapas

Dipartimento di Elettronica, Informazione e BioingegneriaPolitecnico di Milano

This thesis is submitted for theMaster of Science in Computer Science and Engineering

April 2019

Page 2: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective
Page 3: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

To a beautiful friendship

Page 4: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective
Page 5: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Acknowledgements

To our supervisors Marco Brambilla and Pavlos Protopapas, thank you for your passion,dedication and professionalism. You led our learning experience with the DataShack programand this thesis, adding an enormous contribution to the quality of our studies and our personalgrowth.

To Marco Di Giovanni, for his contribution to this work and to our experience in Boston.

To all the other professors and staff of Politecnico di Milano, thank you for this enrichingand impactful experience.

To the competent and warm Harvard IACS faculty, thank you for making us feeling at home.

To our beloved families, for your love, generosity and all the sacrifices you made to supportus.

To our friends, for always being there for us, in good and bad times.

Page 6: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective
Page 7: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Abstract

Computer vision tasks have recently seen great advancements thanks to the progress andwidespread adoption of convolutional neural networks in various fields of application. Toachieve such excellent results, these models are trained on very large datasets of pristineand high-quality images, with the training procedure often becoming a long and resource-intensive process. One consequence of this fact is the adoption of pre-trained models bywhoever lacks the resources, data or know-how to perform the training of these models.

One issue with pre-trained models is that they are usually trained overlooking the fact thatdistortions like image blur and additive noise commonly occur during image acquisition, andthat testing these models on such distortions causes a significant drop in performance. A verywell known and commonly adopted solution to this problem is to fine-tune the network withdistorted samples, but with larger networks, applying such procedure on all the parametersmay become exceedingly costly.

In our thesis, we tackle exactly this problem by proposing a more efficient solution thatis able to attain state-of-the-art performance at a lower computational cost: We start fromthe observation that in each layer of a convolutional neural network some filters are moresusceptible to image distortion than others. We propose a metric to identify these filtersand rank them, for each convolutional layer, based on the impact that such distortion has onthem. Finally, we fine-tune only the most affected filters, significantly reducing the numberof parameters to retrain.

The results of our work clearly demonstrate that the proposed technique recovers mostof the lost performance due to input data distortion, making the retrained filters invariant toit, outperforming the usual layer-level fine-tuning of the network, when few noisy labeledsamples are available, all at a noticeably lower computational cost.

Page 8: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective
Page 9: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Abstract

I processi di visione artificiale hanno di recente fatto grandi passi avanti grazie al progresso eall’adozione diffusa di reti neurali convoluzionali in svariati ambiti applicativi. Per ottenererisultati così eccelsi, questi modelli sono allenati su dataset di grandi dimensioni di immaginiincontaminate e di alta qualità, con la procedura di allenamento che spesso diventa un proces-so lungo ed altamente dispendioso in termini di risorse computazionali. Una conseguenza diquesto fatto è l’adozione di modelli pre-allenati da parte di chiunque non disponga di risorse,dati o know-how per eseguire l’allenamento di questi modelli.

Un problema con i modelli pre-allenati è che solitamente vengono istruiti con dati cheprescindono dal fatto che durante l’acquisizione di immagini si verifichino comunementedistorsioni come sfocatura e rumore additivo, e che testare questi modelli su tali immaginidistorte provochi un calo significativo delle prestazioni. Una soluzione molto conosciuta ecomunemente adottata per questo tipo di problema è quella di affinare la rete già allenata concampioni distorti. Tuttavia, su reti estremamente grandi, l’applicazione di tale procedura sututti i parametri della rete può risultare eccessivamente gravoso.

Nella nostra tesi affrontiamo esattamente questo problema proponendo una soluzione piùefficiente che sia in grado di raggiungere prestazioni pari allo stato dell’arte, ma ad un costocomputazionale sensibilmente inferiore: Partiamo dall’osservazione che in ogni livello di unarete neurale convoluzionale alcuni filtri siano più suscettibili alla distorsione delle immaginirispetto ad altri. Proponiamo quindi una metrica per identificare questi filtri e classificarli,per ogni livello convoluzionale, in base all’impatto che tale distorsione ha su di essi. Infine,affiniamo solo i filtri più sensibili al rumore, riducendo in modo significativo il numero diparametri da riallenare.

I risultati del nostro lavoro dimostrano chiaramente che la tecnica proposta recupera lamaggior parte delle prestazioni perse a causa della distorsione nei dati di input, rendendo ifiltri riqualificati insensibili a queste impurità, superando la più comune tecnica di affinamentoper livello, quando pochi dati etichettati sono disponibili, il tutto ad un costo computazionalesensibilmente inferiore.

Page 10: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective
Page 11: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Contents

List of Figures xiv

List of Tables xviii

1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 52.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Supervised and Unsupervised Machine Learning . . . . . . . . . . 62.1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Overfitting and Model Selection . . . . . . . . . . . . . . . . . . . 7

2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . 82.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 132.3.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.4 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.5 Dataset Augmentation and Noise . . . . . . . . . . . . . . . . . . . 15

2.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Page 12: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

xii Contents

2.4.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Transfer Learning for Deep Learning . . . . . . . . . . . . . . . . . . . . . 192.6 Image Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Related Work 213.1 Impact of Image Distortion on Computer Vision Tasks . . . . . . . . . . . 213.2 Methods for Robust Image Classification . . . . . . . . . . . . . . . . . . . 22

4 Methodology 244.1 Model Training on Source Dataset . . . . . . . . . . . . . . . . . . . . . . 244.2 Noise Impact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Measuring Filters’ Susceptibility to Input Data Distortion . . . . . . 274.2.2 Ranking Filters by Susceptibility to Input Data Distortion . . . . . 30

4.3 Model Fine-Tuning on Target Dataset . . . . . . . . . . . . . . . . . . . . 324.3.1 Activation Maps Swapping . . . . . . . . . . . . . . . . . . . . . . 324.3.2 Selective Filter-Level Fine-Tuning . . . . . . . . . . . . . . . . . . 34

4.4 Non Associative Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4.1 Finding Representative Images . . . . . . . . . . . . . . . . . . . . 364.4.2 Ranking Filters by Susceptibility to Input Data Distortion . . . . . 37

5 Implementation 385.1 Source Code and Deployment . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Model Training on Source Dataset . . . . . . . . . . . . . . . . . . . . . . 395.3 Noise Impact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4 Model Fine-Tuning on Target Dataset . . . . . . . . . . . . . . . . . . . . 415.5 Non Associative Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Experiments 446.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.1.2 Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . 456.1.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2.1 Noise Impact on Baseline Models . . . . . . . . . . . . . . . . . . 486.2.2 Activation Maps Swapping . . . . . . . . . . . . . . . . . . . . . . 486.2.3 Filters Fine-Tuning on Target Dataset . . . . . . . . . . . . . . . . 49

Page 13: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Contents xiii

6.2.4 Non Associative Ranking . . . . . . . . . . . . . . . . . . . . . . . 56

7 Conclusion 58

Bibliography 61

Appendix A Code Listings 65A.1 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.1.1 Simple Convolutional Net . . . . . . . . . . . . . . . . . . . . . . 65A.1.2 All Convolutional Net . . . . . . . . . . . . . . . . . . . . . . . . 66

A.2 Noise Impact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.2.1 Earth Mover’s Distance Computation . . . . . . . . . . . . . . . . 69A.2.2 Ranking Convolutional Filters per Layer with Borda Count . . . . . 70

A.3 Models for Selective Filter-Level Fine-Tuning . . . . . . . . . . . . . . . . 72A.3.1 Fine-Tunable Simple Convolutional Net . . . . . . . . . . . . . . . 73A.3.2 Fine-Tunable All Convolutional Net . . . . . . . . . . . . . . . . . 75

A.4 K-Medoids Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Page 14: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

List of Figures

2.1 Artificial Neural Network with 3 layers: an input layer with 3 neurons, ahidden layer with 2 neurons, and a single output neuron. . . . . . . . . . . 8

2.2 Artificial neuron model. Each input xi is multiplied by a weight wi and theirweighted sum is passed through the activation function g(·), together withthe bias −b to produce the output y. . . . . . . . . . . . . . . . . . . . . . 9

2.3 Activation functions plots: ReLU grows linearly for positive inputs, thesigmoid and tanh are both subject to saturation, but the sigmoid stays positive,while the hyperbolic tangent does not. . . . . . . . . . . . . . . . . . . . . 10

2.4 Dropout applied to a network with one hidden layer. In this case the weightmask is applied to both the input and hidden layers. . . . . . . . . . . . . . 14

2.5 Convolution operation implemented by a 2D convolutional layer with kernelsize of 2 and stride of 1 in both directions. . . . . . . . . . . . . . . . . . . 16

2.6 Max pooling operation with a 2 by 2 pooling window . . . . . . . . . . . . 182.7 Convolutional Neural Network architecture diagram, each plane is a feature

map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Effect of image quality on DNN predictions, with predicted label and con-fidence generated by a pre-trained All-Conv Net [52] model. Distortionseverity increases from left to right, with the left-most image in a row havingno distortion (original). Green text indicates correct classification, whilered denotes misclassification. Left: Examples from the CIFAR-100 test setdistorted by Gaussian blur. Right: Examples from the CIFAR-100 test setdistorted by Additive White Gaussian Noise (AWGN). . . . . . . . . . . . 25

Page 15: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

List of Figures xv

4.2 Two-dimensional t-SNE embedding of the last layer features of the All-Convnet, visualized for original (undistorted), blurred and noise affected imagesof 10 classes from the CIFAR-100 test set, with each color representing aseparate class and distortion severity increasing from left to right. Each pointin the embedding represents an image in the 10 class subset, with 100 imagesper class. Top row: Embedding for Gaussian blur affected images. Bottomrow: Embedding for AWGN affected images. . . . . . . . . . . . . . . . . 26

4.3 Examples of associative and non associative pairs. Sharp and distorted pairsrefer to the same images in the associative case. This does not hold for thenon associative case, as the images in the pair are totally uncorrelated. . . . 27

4.4 High level diagram of the associative method. The clean and noisy versionof the same image are fed into the same baseline model, producing twodifferent sets of feature maps activations, which are then compared using theEMD metric to generate an array of distances, where each element representsthe scalar distance between each of the single feature map activations in theinvestigated layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Advantages of Wasserstein metric compared to Kullback–Leibler divergence. 304.6 High level diagram representing the feature maps swapping technique. A

(clean, noisy) pair of the same image is needed, along with the ranking ofthe convolutional filters, for each layer, that orders them by their distortionsusceptibility. Once the feature maps have been computed for both images,we proceed by swapping the distorted feature maps with their undistortedcounterparts, according to the order imposed by the ranking. . . . . . . . . 34

6.1 Network architectures for our baseline models. Convolutional layers areparameterized by kxk-conv-d-s-p, where kxk is the spatial extent of thefilter, d is the number of output filters in a layer, s represents the filterstride and p indicates the zero-padding added to both sides of the input.Max-pooling layers are parameterized as kxk-maxpool-s-p, where s is thespatial stride and p indicates the implicit zero padding to be added on bothsides. Batch normalization layers are parameterized by d-bn, where d is thenumber of features in the layer. Dropout layers are parameterized by pr-dp,where pr is the dropout probability value. Fully connected linear layers areparameterized by d-fc, where d represents the dimensionality of the outputspace. (a) Simple convolutional network for CIFAR-10. (b) All-Conv Netfor CIFAR-100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Page 16: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

xvi List of Figures

6.2 Distortion susceptibility of convolutional filters in the first convolutionallayer of both baseline models, when tested on training images respectivelyfrom CIFAR-10 and CIFAR-100. Even though the Borda counts are slightlydifferent between the two types of distortions, it is clear how the ranking ofthe most susceptible filters tends to be independent of the type of distortionapplied. In fact, considering only the 25% of the most sensitive feature mapsof each depicted convolutional layer, 6 out of 8 times there is a match in theselected filters, for the CIFAR-10 case, while a 19 out of 24 ratio for theCIFAR-100 one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3 The two graphs represent the model accuracy when tested on distorted data(Additive Gaussian White Noise on top, Gaussian blur at the bottom) with anincreasing number of feature maps swapped with the one coming from thesame undistorted image. The green line represents the model performanceon clean images. As expected, when all 96 feature maps are swapped, thetwo lines meet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.4 Fine-tuning effects on the classification performance of noisy inputs, as afunction of the number of training distorted images used to perform suchfine-tuning. For each plot, three different configurations were considered: (1)most: fine-tuning is performed only on the 25% of the layer convolutionalfilters most susceptible to image distortion; (2) least: fine-tuning is performedonly on the 25% of the layer convolutional filters least susceptible to imagedistortion; (3) all: fine-tuning is performed on all convolutional filters of thelayer, independently from their susceptibility to image distortion. For theplots in the first two columns, fine-tuning was done on the baseline modeltrained on CIFAR-10 undistorted images, correcting the convolutional filtersfrom both convolutional layers of the network. For the plots in the right-mosttwo columns, fine-tuning was performed on the baseline model trained onCIFAR-100 pristine samples, correcting the convolutional filters only fromthe first three convolutional layers of the network. In the last row we presentthe fine-tuning performed with the least amount of distortion, which barelyproduces any effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Page 17: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

List of Figures xvii

6.5 Disparity between corresponding layer activations on sharp and noisy ver-sions of an example image. Each heat-map represents the Hamming distancebetween binarized feature vectors (i.e., if each channel is positive or zero)at corresponding locations in the sharp and Gaussian distorted inputs. Wevisualize these distance maps for the first three convolutional layers in theAll-Conv Net architecture, comparing three different models: Top: Base-line, trained on undistorted images; Middle: Network where all the filtersin the first three convolutional layers where fine-tuned with 5000 Gaussiandistorted training images from CIFAR-100; Bottom: Network where only thetop 25% of the convolutional filters most susceptible to Gaussian distortionin the first three convolutional layers where fine-tuned with 5000 Gaussiandistorted training images from CIFAR-100. The numbers between the roundbrackets indicate the element-wise sum of each element in the correspondingheat-map: The higher the number, the larger is the disparity between thematching layer activations on the sharp and noisy version of the exampleimage. We see that model where only the filters most susceptible to Gaussiandistortion where fine-tuned produces feature activations that are relativelyinvariant to the presence of Gaussian noise in the input image. . . . . . . . 55

6.6 Two-dimensional t-SNE embedding of the last layer features of the fine-tuned All-Conv net, visualized for original (undistorted), blurred and noiseaffected images of 10 classes from the CIFAR-100 test set, with each colorrepresenting a separate class and distortion severity increasing from left toright. Each point in the embedding represents an image in the 10 class subset,with 100 images per class. Top row: Embedding for Gaussian blur affectedimages. Bottom row: Embedding for AWGN affected images. . . . . . . . 56

Page 18: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

List of Tables

6.1 Comparison of the number of matching convolutional filters, per layer, be-tween the associative ranking and several configurations of the non asso-ciative ones. All the non associative rankings, and the associative one usedas "ground truth", are based on the All-Conv Net baseline model, trainedon undistorted images from CIFAR-100. All three convolutional layershave 96 filters each, so the top 25% of each layer only considers 24 filters.The noisy images that are used to perform the comparison between cleanand noisy activations were perturbed with AWGN with distortion sever-ity σ = 15 (comparable results are obtained when blurring distortion is inplace). Pixels indicates image pixels were used as features for the clusteringmethod, whereas FTM when the collapsed version of the baseline featuremaps activations - at the output of corresponding convolutional layer - wereadopted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Page 19: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

List of Listings

1 Simple Convolutional Net . . . . . . . . . . . . . . . . . . . . . . . . . . . 662 All Convolutional Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 Earth Mover’s Distance Computation . . . . . . . . . . . . . . . . . . . . . 704 Ranking Convolutional Filters per Layer with Borda Count . . . . . . . . . 725 Merging Activation Maps after the Split . . . . . . . . . . . . . . . . . . . 736 Fine-Tunable Simple Convolutional Net . . . . . . . . . . . . . . . . . . . 757 Fine-Tunable All Convolutional Net . . . . . . . . . . . . . . . . . . . . . 788 K-Medoids Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Page 20: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective
Page 21: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 1

Introduction

1.1 Context

In recent years, deep neural networks (DNNs) have become increasingly good at learning avery accurate mapping from inputs to outputs, from large amounts of labeled data, amongseveral different applications [29, 54, 33, 21]. The ease of design for such networks, affordedby numerous open source deep learning libraries [4, 43], has established DNNs as the go-tosolution for many computer vision applications. Even challenging computer vision taskslike image classification [51, 55] and object recognition [13, 47], which were previouslyconsidered to be extremely difficult, have seen great improvements in their state-of-the-artresults due to the adoption of DNNs.

Increased computational power and the availability of large scale carefully annotateddatasets [7] are two of the most important factors that contributed to the success of such deeparchitectures in computer vision tasks. Nonetheless, since modern DNNs take 2-3 weeks totrain across multiple GPUs on very large datasets [24], it is common to use pre-trained modelsas a starting point whenever a computer vision task needs to be addressed. A pre-trainedmodel is a model that has usually been trained on a vast amount of data, that can be used as astarting initialization point for another model, to solve a very similar task from the originalone, saving time and potentially achieving a better performance than starting model fittingfrom scratch on the target data [41].

A crucial aspect that is very often overlooked while designing DNN based computervision systems, that also affects such pre-trained models, is the visual quality of input images.In most realistic computer vision applications, an input image undergoes some form ofimage distortion, including blur and additive noise, during image acquisition, transmissionor storage. Yet, most popular large scale datasets do not have images with such artifacts,and consequently, pre-trained models that have been trained on such images are not able

Page 22: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2 Introduction

to properly classify samples that present these distortions. This phenomenon goes by thename of dataset shift [37], and indicates every situation in which training and test data followdifferent probability distributions, resulting in an inability by the trained model to makepredictions on this new type of "noisy" data, because it is unable to generalize to situationsthat are different from the one that it has encountered during training.

Obtaining accurately annotated data in this noisy context can be a tedious process and isoften impractical in many situations. It would then be ideal to transfer knowledge gatheredby the model on undistorted data (source domain), to enable training of another model toproperly classify noisy images (target domain), even when few labeled observations fromthis noisy setting are available.

1.2 Problem Statement

To provide a more formal definition of the problem, suppose we have a dataset DT , of limitedsize, where DT = (xT

1 ,y1), ...,(xTn ,yn) ⊆ XT ×Y , in which XT and Y respectively

denote the domain of predictors XT and classes Y , while n = |DT |. We know for a fact thatthis set of collected data points have undergone some form of image distortion, which canbe modeled as a function g(·). DT can then be described as DT = (xT

1 ,y1), ...,(xTn ,yn)=

(g(xC1 ),y1), ...,(g(xC

n ),yn) where xCi represents the undistorted data point, sampled from a

"clean" dataset DC = (xC1 ,y1), ...,(xC

n ,yn) ⊆ XC ×Y , in which XC and Y respectivelydenote the domain of predictors XC and classes Y.

The goal is to use the set of distorted data points DT to train a DNN to perform a genericclassification task. If DT is not of sufficient size to train a DNN model from scratch (withrandom initialization), we may have to move to pre-trained models. Focusing our attentionto image classification settings, a pre-trained model is a convolutional neural network (CNN)which has been trained on a very large set of undistorted images DS = (xS

1,yS1), ...,(x

Sm,y

Sn)

⊆ XS ×YS (where XS and YS respectively denote the domain of predictors XS andclasses YS, while m = |DS|) which is known to perform very well in the task of classifyingimages.

Nevertheless, as shown in [8], testing distorted images with a pre-trained DNN model,even though such image distortions g(·) do not represent adversarial samples for the DNN,results in a considerable drop in classification performance. The reason for this degradationis attributed to the distortion function g(·), which is responsible for increasing the effects ofthe phenomenon known as covariate shift [37], defined as the case in which there is a changein the distribution of the input variables between training and testing data. More formally,assuming that the distribution of the undistorted images P(XC) is the same distribution of the

Page 23: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

1.3 Proposed Solution 3

images used to train the pre-trained model, we have that P(XC) = P(XS). However, becauseof the distortion function g(·), the distribution of the "clean" images, is different from thedistribution of their distorted counterparts (i.e., P(XC) = P(XT )), which means, due to thetransitive property, that P(XC) = P(XS). This result indicates that features learned from adataset of high quality images, such as DS, are not invariant to image distortion or noise, andcannot be directly used for applications where the quality of images is different than that ofthe training images (DT ).

The goal for our work is then to show an approach to potentially shrink the effect of thecovariate shift caused by the distortion function g(·), to eventually come up with a modelthat is robust to this type of phenomenon, leveraging the potential of the proposed transferlearning technique, while keeping the number of parameters to train into a feasible range,considering the limited amount of training samples.

1.3 Proposed Solution

Through our study, we try to address this problem directly acting on on the features learnedby the model on the source domain, so that they become invariant to image distortion, ideallyobtaining features such that XS ≈ XT .

As in [3], we prove that among all the filters of the convolutional layers of a DNN, somefilters are more susceptible to input distortions than others and that correcting the activationsof these filters can help recover the lost performance. However, instead of correcting theactivations, like in [3], we act directly on the filters, so that the learned features becomenearly invariant to image distortion.

Leveraging this finding, we propose a novel technique to rank which are the filters thatare most affected by input data distortion, through an appropriate distance metric and votingtechnique that we will detail in Chapter 4.

We will then present a detailed methodology to directly act on a subset of the afore-mentioned filters per layer, so that the output activations become robust to image distortion,preventing the considerable degradation in classification performance that would normallytake place if the DNN was only trained on undistorted inputs.

Differently from the usual fine-tuning of pre-trained models, that rely on retraining all thefilters in some convolutional layers of the network, we show that fine-tuning only a subset ofthe most affected filters of those layers, achieves a better overall performance and a lowertraining cost than retraining all the filters of those layer, when data in the target domain islimited.

Page 24: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4 Introduction

1.4 Structure of the Thesis

The structure of the thesis is organized as follows:

• Background exposes some of the necessary tools and concepts that are present in thethesis;

• Related Work provides an overview of other studies in the literature that addressedthe same problem that we have covered;

• Methodology presents a detailed description of the entire pipeline that we developedto solve the problem;

• Implementation reports the major technologies and tools of the system implementedto validate our proposed approach;

• Experiments is the chapter devoted to the presentation of all the results and discussionsof the experiments conducted to validate the technique;

• Conclusion wraps up the discussion with concluding remarks and pointers for potentialfuture work.

Page 25: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 2

Background

In recent years the interest around deep learning has grown massively. While having a longand rich history, dating back to the 1940s, the recent increase in the amount of availabletraining data and the improved computing infrastructures enlarged the scope of the possibleapplications of this technique. Framing this approach in the broader machine learning field isthe objective of the following section, that will also provide all the foundations necessary forthe reader to fully understand the concepts presented in this thesis.

We start by introducing the machine learning field, the tasks it solves and the methods ituses. We then approach deep learning, describing its main characteristics and some modelexemplars. Finally, we present transfer learning and its benefits when applied to deep learningmodels.

2.1 Machine Learning

2.1.1 Introduction

Deep learning is a subset of machine learning. Having a comprehensive understanding ofthe basic principles of the latter is a necessary step to better frame and understand the first.Machine learning provides a set of automated methods of data analysis capable of performinga given task, for example predicting new data points, by detecting and reusing patterns indata. Machine learning methods exploit the concept of learning from data, to solve a giventask; in [36] we find a more formal definition of a learning algorithm:

"A computer program is said to learn from experience E with respect to some class of tasksT and performance measure P, if its performance at tasks in T , as measured by P, improves

with experience E."

Page 26: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6 Background

In the following section, we will present some basic concepts in machine learning.

2.1.2 Supervised and Unsupervised Machine Learning

We call supervised machine learning the set of techniques whose goal is to learn a mappingfrom inputs x to outputs y, given a labeled set of input-output pairs D = (xi,yi)N

i=1. Wecall D training set, while N corresponds to the number of training samples. Usually eachtraining input xi is a vector of numbers. These are called features, attributes or covariates. Ingeneral, however, xi could be an arbitrarily complex structured object. Similarly, the formof the output or response variable can in principle be anything, but most methods assumethat yi is a categorical variable from some finite set, yi ∈ 1, ...,C, or that yi is a real-valuedscalar. In the first case, the problem is known as classification or pattern recognition, and inthe latter the problem is known as regression.

On the other side of the spectrum we find unsupervised learning: It comprises problemsin which we only have inputs, D = xiN

i=1, and we want to find interesting patterns in thedata. This is a much less well-defined problem, that opens up to a broad variety of techniquessince we are not told what kinds of patterns to look for, and there is no obvious error metricto use.

2.1.3 Models

Depending on the problem different kinds of probabilistic models can be used, more specifi-cally of the form p(y|x) for supervised learning or p(x) for unsupervised learning. Amongthe many ways to describe these models, the fundamental and most basic one is to distinguishthem in two categories: parametric and non-parametric models. Parametric models are fasterto use but they require assumptions on the nature of the data distributions in order to work.Non-parametric models, on the other hand, are more flexible but quickly become intractabledue to high computational requirements for large datasets. An example of a non-parametricmodel is the K-nearest neighbors classifier: To classify a new sample it counts the labelsoccurrence of each class among the k closest points in the training set and returns thatempirical fraction as an estimate for the new sample. It doesn’t need any information on thedata distribution, but for high dimensional samples and large amounts of data computingdistances between samples becomes very demanding.

Page 27: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.2 Artificial Neural Networks 7

2.1.4 Overfitting and Model Selection

When fitting our model, either because we have too few training samples, or because of anovercapable model for the task, we may end up capturing every minor variation in the data;this phenomenon is called overfitting and it has to be avoided since the model would likelyreproduce the noise present in the training data, rather than true distribution. Overfittingleads to poor generalization performance, which measures how well the model does withnew samples. For example, if we try to fit the K-Nearest Neighbors classifier with a K=1, themodel makes no errors on the training data, but it will very likely not perform well on unseensamples as the decision surface will be very irregular; how can we select a good value for Kso that we avoid overfitting? This problem, in which we have to choose between models withdifferent degrees of flexibility, is called model selection and in literature, we can find multipletechniques that aim at solving it. In order to assess whether or not our model is overfittingwe need to test its performance and we can do so by computing the misclassification rate,that is defined as follows:

err( f ,D) =1N

N

∑i=1

I( f (xi) = yi) (2.1)

where f is the model, D is the training set and N is the number of training samples. Aswe previously mentioned, the model’s performance on training data is not always a goodindicator of generalization performance, so the only way we have to avoid selecting anoverfitting model is to test it on unseen data. The misclassification rate of a model on a largeindependent test set is a good approximation of its generalization performance, but in generalwe do not have access to a large set of future data; this means that we need to split the datawe have into two sets: one part used for training the model, and a second part, called thevalidation set, used for selecting the model complexity. In cases in which the number oftraining cases is too small to perform this partition, it is common to use a technique calledcross validation (CV). In CV we split the training data into K folds; then, for each foldk ∈ 1, . . . ,K, we train on all the folds but the k’th, and test on the k’th, in a round-robinfashion. We then compute the error averaged over all the folds, and use this as a proxy forthe test error.

2.2 Artificial Neural Networks

Deep learning provides a powerful framework for machine learning tasks. Most tasks thatconsist of mapping an input vector to an output vector, and that are easy for a person to

Page 28: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

8 Background

do rapidly, can be accomplished via deep learning, given sufficiently large models andsufficiently large datasets of labeled training examples. By leveraging different declinationsof its quintessential model architecture (i.e. the Neural Network) deep learning provides apowerful set of tools to solve many supervised and unsupervised learning tasks.

2.2.1 Feedforward Neural Networks

The Feedforward Neural Network is the fundamental deep learning model; its goal is toapproximate a function f that maps inputs x to outputs y. A feedforward neural networkdefines a mapping y = f (x,θ) and learns the parameters θ that results in the best functionapproximation. It is composed of multiple units, called neurons, organized in layers andconnected to form an acyclic structure. If the structure contains feedbacks, then the model iscalled Recurrent Neural Network. The first layer of a feedforward neural network is calledinput layer, the last one is called output layer, while all the layers in between them arereferred to as hidden layers.

x1

x2

x3

h2

h1

y

Input layer Hidden layer Output layer

Figure 2.1 Artificial Neural Network with 3 layers: an input layer with 3 neurons, a hiddenlayer with 2 neurons, and a single output neuron.

These models are called artificial neural networks because of the similarities they sharewith the biological structure of the brain. The units, or neurons, that constitute ArtificialNeural Networks in fact loosely model the human neuron architecture. The human neuron iscomposed of dendrites, axon, synapses and of the cell body:

• Dendrites collect input charges from synapses (either inhibitory or excitatory synapses,with different weight)

Page 29: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.2 Artificial Neural Networks 9

• When the charge level accumulated from the dendrites is above a threshold the neuron"fires"

• The axon propagates the neuron’s output to its synapses

In the same way, in the artificial neuron model, each element of the input vector ismultiplied by a weight, the results are summed up and when the weighted sum of the inputsis above a given threshold called "bias" the neuron’s activation function produces a positiveoutput. More formally, given an input vector x, the artificial neuron output y is given by:

y = g(I

∑i=1

xiwi −b) (2.2)

where wi is the weight for the ith component of the input vector, b is the bias or threshold, andg is the neuron’s activation function. These concepts date back to the 1950s, and were firstintroduced by Rosenblatt’s Perceptron [48], which enabled the training of a single neuron.

( )∑i=0

i=I

xiwi g(⋅)

­b

x1

xI

... y

w1

wI

...

Figure 2.2 Artificial neuron model. Each input xi is multiplied by a weight wi and theirweighted sum is passed through the activation function g(·), together with the bias −b toproduce the output y.

The activation function g is a very important piece of the puzzle: It is usually referred toas "nonlinearity", and in fact, it allows each layer to learn a non-linear function of its inputs.Since the network is composed of stacked layers, the output of each layer becomes the inputof the following one, and thanks to this structure, what we obtain after training a neuralnetwork is a non-linear function of its inputs that is the composition of all the non-linearfunctions learned by its layers. More formally given a set of input x and the function flearned by a feedforward neural network with J layers we have that:

f (x) = f J( f J−1(. . . f 0(x))) (2.3)

where f j represents the non-linear function learned by layer j, f J and f 0 are the functionslearned by the output and input layers respectively.

Page 30: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

10 Background

Since, like many other models in machine learning, neural networks are trained using agradient-based approach, the neuron activation function needs to be not only continuous butalso differentiable at all input points. There are not yet many definitive guiding theoreticalprinciples on how to choose the right activation function, but the most used nowadays is theRectified Linear Unit (2.4)

g(z) = ReLU(z) = max0,z (2.4)

because of its ability to maintain large and consistent gradients during training [14]. Prior tothe ReLU the most used activation functions were the sigmoid function

g(z) = σ(z) =1

1+ e−z (2.5)

and the hyperbolic tangent function

g(z) = tanh(z) = 2σ(2z)−1 (2.6)

these two are tightly related as shown in (2.6). Sigmoidal units saturate across most of theirdomain: they saturate to a high value when the input is very positive, to a low value when theinput is very negative, and are only strongly sensitive to their input near zero and this factcan make gradient-based learning very difficult. The hyperbolic tangent on the other handis simpler to train because of its closer resemblance with the identity function close to theorigin. We will explain this more in depth in the following section.

Figure 2.3 Activation functions plots: ReLU grows linearly for positive inputs, the sigmoidand tanh are both subject to saturation, but the sigmoid stays positive, while the hyperbolictangent does not.

Page 31: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.2 Artificial Neural Networks 11

2.2.2 Training

Neural networks, like many other machine learning models, are known to be very successfulbecause of their capability to learn from data. As it was previously mentioned at the beginningof the chapter, a model learns if its performance in a given task improves with experience, inthis case, if it improves with more data. Because of its structure, given an input sample, theoutput of a neural network depends on the value of its weights, so in order for it to producethe desired output, its weights need to be adjusted accordingly.

Learning in artificial neural networks is accomplished by means of gradient-based learn-ing, a framework that will be described in the following paragraphs, in association with anoptimization algorithm called backpropagation. In order to learn, very much like humans, aneural network needs to know whether it is performing well or not, it needs a performancemetric. In machine learning, this metric is usually referred to as the cost function or loss,which is a function of the model parameters and measures how close the behavior of the net-work is, with respect to the desired one. Using gradient-based learning and backpropagationto minimize the cost function with respect to the model parameters leads us to the optimalconfiguration for that model, the one that solves the problem formalized by the loss.

Because of the neural network nonlinearity, the loss function most often becomes non-convex, making its optimization a difficult task. For this reason, gradient-based iterativeoptimization techniques such as Stochastic Gradient Descend proved to be good solutionsfor the problem. The main idea behind gradient-based algorithms, introduced by A. Cauchyin 1847, is to compute the gradient of the loss function with respect to the model parameters;the gradient gives us the slope of the function and since we want to minimize it, it tells us inwhich direction we need to move to do so. This means that updating the model parameters inthe direction opposite to the one pointed by the gradient will result in a reduction of the lossfunction value, which implies an improvement in the model performance.

Due to the usually very large amount of parameters, deep learning models generallyrequire huge volumes of data in order to generalize well and accomplish the task they arebeing trained for. In order to compute the model performance, which is a required step inthe learning procedure, we need to apply the loss function to the entire dataset. This stepbecomes extremely computationally intensive, and since it has to be done at each iterationof the gradient descent algorithm in order to compute the updates, it makes the algorithmunfeasible for deep learning models. To solve this issue, instead of computing the lossgradient on the entire dataset, we can compute it on each item of a small randomized batchof samples, and then average it out to approximate the actual gradient. This variant of thealgorithm, called Stochastic Gradient Descent, is the most widespread training algorithm fordeep learning models.

Page 32: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

12 Background

The only missing component of the training process is the one that allows for thecomputation of the gradient, which is a vector containing the partial derivatives of the costfunction with respect to all the network weights and biases. Computing such derivatives mayseem an extremely complex problem because of the many connections and dependenciesamong the various parameters; luckily, thanks to an algorithm called backpropagation, thistask can be easily accomplished by reusing a lot of computation. Here we briefly explainhow it works. The algorithm revolves around these four equations:

δL = ∇aC⊙σ

′(zL) (2.7)

δl = ((wl+1)T

δl+1)⊙σ

′(zl) (2.8)

∂C∂bl

j= δ

lj (2.9)

∂C∂wl

jk= al−1

k δlj (2.10)

The main concept of the algorithm is that the derivative of the parameters in layer ldepends on a quantity δ l called error, which can be expressed as a function of the error δ l+1

in layer l + 1 as shown in equation 2.8. After computing the neuron activation aL at theoutput layer L with the forward pass, equation 2.7 tells us how to compute the error for thelast layer. With this piece of information, we can now go ahead and compute the errors forevery layer, starting from the output one, and propagating the results backward with 2.8.Once we have the value of the error for every layer, we can use 2.9 and 2.10 to compute thederivatives of the loss C with respect to weights wl

jk and biases blj.

Plugging backpropagation in the Stochastic Gradient Descent algorithm, we can effi-ciently compute the parameters updates to minimize the loss function in an iterative fashion.Given a samples batch of size m, we can compute the updates with the following equations:

wl → wl − η

m ∑x

δx,l(ax,l−1)T (2.11)

bl → bl − η

m ∑x

δx,l (2.12)

where η represents the learning rate, which regulates how fast we want the network to learn.

Page 33: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.3 Regularization 13

2.3 Regularization

A central problem in machine learning is how to make an algorithm that will perform well notjust on the training data, but also on new inputs. In other words, how to avoid overfitting andimprove generalization performance. As we mentioned in 2.1.4, generalization performancecan be measured on a large independent set of samples, usually referred to as the test set, thathas not been used during the training phase. The strategies utilized in machine learning thataim to reduce the error on the test set are called regularization techniques.

2.3.1 Parameter Norm Penalties

Many regularization approaches are based on the intuition that an overly capable model willvery likely overfit; in order to avoid this, model complexity is penalized during model fitting,by adding a parameter norm penalty to the loss function. We can rewrite the loss functionthen as a function of the input X , the output y and parameters θ :

J(θ ;X ,y) = J(θ ;X ,y)+αΩ(θ) (2.13)

where Ω is the norm penalty and α ∈ [0, inf) weights the contribution of the penalty term.Minimizing J, the training algorithm will decrease the training error while keeping a lowparameter norm, thus reducing the model capacity. Among the different choices of Ω that canbe made the most common is the L2 norm, also known as weight decay. This regularizationstrategy drives the weights closer to the origin by adding a regularization term 1

2 ∥w∥22 to the

objective function.

2.3.2 Dropout

Dropout ([53]) is an efficient and powerful regularization technique that can be appliedto a variety of models. It is based on the idea that averaging out the predictions of manymodels produce better results both in terms of performance and generalization. Trainingand evaluating many neural networks may be impractical because these operations are rathercostly both in terms of runtime and memory. In order to make this technique feasible, dropoutapproximates this process by training the ensemble consisting of all subnetworks that canbe formed by removing nonoutput units from an underlying base network. This means thatfor each minibatch of samples in the learning algorithm, we independently sample froma uniform distribution a binary mask to apply to the network units. Then we multiply theoutput of each neuron with the value of the mask, effectively "turning off" some neurons, and"leaving on" some others. This way, for each minibatch we train a different network. The

Page 34: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

14 Background

prediction is then computed by sampling some masks from the same distribution used duringtraining and averaging out the network output.

Base network Subnetworks

Figure 2.4 Dropout applied to a network with one hidden layer. In this case the weight maskis applied to both the input and hidden layers.

2.3.3 Early Stopping

Early stopping is a very common learning strategy that inherently improves the generalizationperformance of deep learning models. Usually, in Stochastic Gradient Descent, the numberof iterations over which the algorithm runs is a predefined quantity. By observing the trainingand test error curves we can notice that after a certain amount of epochs, the model stopsimproving its validation performance, and only improves in training performance instead.This is a clear sign of overfitting. Given this observation, it is clear that the best performingmodel is the one with the lowest validation error, and early stopping aims at finding exactlythat. Instead of running the learning algorithm for a fixed amount of epochs, early stoppingstops when no configuration has improved the model performance over the best-recordedvalidation error for some pre-specified number of iterations.

2.3.4 Batch Normalization

Batch normalization is a method of adaptive reparametrization that proved itself as a greatway to ease and speed up neural networks training; it also has some regularization properties.The batch norm reparametrization reduces by a great margin the problem of coordinatingupdates across many layers, and to do so it normalizes the mean and standard deviationof each layer activations. Given H, a minibatch of activations in layer l where each row

Page 35: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.3 Regularization 15

corresponds to the activations of an example, we compute the normalized H ′ as:

H ′ =H −µ

σ

µ =1m ∑

iHi

σ =

√δ +

1m ∑

i(H −µ)2

i

where µ is a vector containing the mean of each unit over the samples in the minibatchand σ is a vector containing the standard deviation of each unit over the samples in theminibatch. The rest of the network then operates on H ′ in exactly the same way that theoriginal network operated on H. Normalizing the mean and standard deviation of a unit canreduce the expressive power of the neural network containing that unit. To maintain theexpressive power of the network, it is common to replace the batch of hidden unit activationsH with γH ′+β rather than simply the normalized H ′. The variables γ and β are learnedparameters that allow the new variable to have any mean and standard deviation.

2.3.5 Dataset Augmentation and Noise

The most basic and most effective way to improve generalization is to train on more data;of course in practice the amount of data available is limited, but one simple way to getaround this is to generate samples, starting from the ones we have. This practice calleddata augmentation cannot, of course, be applied to any setting but it proved successful forexample in object recognition problems, where augmenting the image dataset was just amatter of adding distorted samples, duplicating the respective labels. One must be carefulnot to apply transformations that would change the correct class. For example, opticalcharacter recognition tasks require recognizing the difference between "b" and "d" and thedifference between "6" and "9" so horizontal flips and 180° rotations are not appropriateways of augmenting datasets for these tasks. Another form of data augmentation that wasfound to improve artificial neural networks generalization is input noise injection ([45]). Thistechnique improves the robustness of the network, and it tends to shrink the parameters norm;moreover, when applied to hidden units, it has sort of the same effects as dropout. Indeeddropout is the main development of the noise injection regularization approach.

Page 36: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

16 Background

2.4 Convolutional Neural Networks

Convolutional neural networks ([31]) are a specialized kind of neural network capable ofprocessing data known to have a grid-like topology. As described in [14]:

Convolutional networks are simply neural networks that use convolution in place of generalmatrix multiplication in at least one of their layers.

2.4.1 Convolutional Layers

The convolution is an operation on two functions of a real-valued argument described by thefollowing expression:

s(t) =∫

x(a)w(t −a)da

Here we also provide a discrete form of the same expression, which will be more useful forour application setting:

s(t) =a=+∞

∑a=−∞

x(a)w(t −a)

In convolutional neural networks, x is referred to as the input, w as the kernel and the outputas the feature map. As shown in figure 2.5, a window of weights represents the kernel.This kernel slides over the input and at each location, it gets multiplied element-wise withthe underlying input elements. The sum of the multiplication results is the result of theconvolution at that location; the dimension of the step made by the window is referred to asthe stride.

aw+bx+ey+fz

bw+cx+fy+gz

cw+dx+gy+hz

ew+fx+iy+jz

fw+gx+jy+kz

gw+hx+ky+lz

iw+jx+my+nz

jw+kx+ny+oz

kw+lx+oy+pz

w x

y z* =

b c d

e f g h

i j k l

m n o p

a

Input

Kernel

Output

Figure 2.5 Convolution operation implemented by a 2D convolutional layer with kernel sizeof 2 and stride of 1 in both directions.

Page 37: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.4 Convolutional Neural Networks 17

There are three main ideas underlying the convolutional neural networks success: sparseinteractions, parameter sharing, and equivariant representations

Sparse interactions Since the kernels used in convolutional neural networks are smallerthan the input they receive, a convolutional layer has far fewer parameters with respect to afully connected layer in which each input point is connected to a weight.

Parameters sharing It refers to using the same parameters for more than a function in amodel. In a fully connected layer, each parameter is used once to compute the output ofthe layer; in the convolutional one instead, each parameter of the kernel is passed on everyposition of the input, except some boundary values depending on the layer’s architecture.So, by construction, convolutional layers require far fewer parameters with respect to fullyconnected ones, making the computation of large dimension inputs a feasible task.

Equivariance This property is due to convolutional layers architecture and the operationthey implement. A function is said to be equivariant when a modification in the inputreflects in the same way on the output. The operation implemented by convolutional layersis equivariant to translations in the input, and this is very important for example for imageclassification because it means that convolutional kernels will be able to detect patterns inthe input independently of the pattern position in the picture. Moreover, if the pattern inthe input changes its position, its representation will move the same amount in the output,preserving feature locality.

2.4.2 Pooling Layers

Following convolutional layers we usually find pooling layers; such layers apply a poolingoperation over a window of fixed size across the convolutional layer response, producing inoutput a single value for each input region. There are different kinds of pooling operations,but the most common one is the max operator, where the maximum neuron response inthe window is produced as output. These layers serve two purposes: they improve noiserobustness and increase the size of the receptive field in deeper layers without increasing thesize of the filters.

Page 38: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

18 Background

max(a,b, e, f)

max(b,c, f, g)

max(c,d, g, h)

max(e,f, i, j)

max(f,g, j, k)

max(g,h, k, l)

max(i, j,m, n)

max(j,k, n, o)

max(k,l, o, p)

=

b c d

e f g h

i j k l

m n o p

a

Input

Poolingwindow

Output

Figure 2.6 Max pooling operation with a 2 by 2 pooling window

2.4.3 Network Architectures

The basic and most used Convolutional Neural Network architecture is composed of convo-lutional and pooling layers, whose benefits were introduced in the previous section, followedby one or more fully connected layers to compute the desired predictions (2.7).

Max-Pool Convolution Max-Pool Dense

8@128x128

8@64x64

24@48x4824@16x16

1x256

1x128

Figure 2.7 Convolutional Neural Network architecture diagram, each plane is a feature map

This architecture was popularized by [31], with its success as a hand written digitclassification tool. Since then, the main focus in the field was to make deeper and deepernetworks, with great advancements being achieved thanks to the ImageNet Large ScaleVisual Recognition Challenge [50]. Because of the large dimension and complexity of thedataset itself, deeper and deeper models have been gaining popularity over the last years, themost notable being "AlexNet" [29], "GoogLeNet" [55], "VGG Net" [51], and lastly "ResNet"[19] and "DenseNet" [20], with the latest ones having hundreds of layers.

Page 39: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

2.5 Transfer Learning for Deep Learning 19

2.5 Transfer Learning for Deep Learning

As defined in [42]:

Given a source domain DS and learning task TS, a target domain DT and learning task TT,transfer learning aims to help improve the learning of the target predictive function fT(·) in

DT using the knowledge in DS and TS, where DS =DT, or TS =TT.

In literature many applications of transfer learning techniques can be found [42], sinceknowledge transfer can be implemented in various ways, depending on the task, domains,and models. In the context of deep learning, applying transfer learning techniques usuallyrefers to reusing the weights from an existing network trained on a source dataset for agiven task, as initialization of a different model for a different task or for a different domain.This technique is very useful in deep learning in particular since training a network fromscratch, with limited data, poor initialization, and a lack of regularization to control themodel capacity may very easily and slowly lead to a sub-optimal minimum, thus failingto generalize well [39]. Reusing the weights of another network as initialization for aConvolutional Neural Network can boost performance even when small target training setsare available. In particular in [58, 59] it was observed how the first layers of the networktend to learn simpler features that generalize more easily to different tasks, with respect todeeper layers; for networks trained on similar tasks and datasets, freezing the first few layersof the source network, and training the remaining layers at a low learning rate proves to bea good strategy. This technique, called fine-tuning, is known to be a good solution also forthe problem treated in our thesis as can be seen in different publications [60, 56]; later on inChapter 4 we will explain how we used and perfected this technique to achieve similar orbetter results, with further improvements in computational efficiency.

2.6 Image Distortion

The availability of large high quality image datasets [7, 11] has been crucial for successfullytraining very deep and complex networks. In practical applications though it is common forimages to be affected by different kinds of distortions, usually in the form of blur or noise.Blur can be caused by camera shake or lack of camera focus and it can also affect picturestaken with high quality cameras; noise on the other hand is usually due to bad lightingconditions or high sensor temperatures [60]. Distortions like these may impact the ability ofmany deep learning models to perform as well as they do when they are presented with cleanimages: [56] shows how state-of-the-art models trained on high-quality image datasets make

Page 40: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

20 Background

unreliable, albeit low-confidence, predictions when they encounter blur in their inputs, dueto their inability to generalize from their sharp training sets; [8] instead demonstrates thatboth noise and blur cause significant differences in the outputs of convolutional layers, whencomparing clean and distorted images.

Page 41: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 3

Related Work

In literature we can find a multitude of approaches that address the covariate shift issuein image datasets due to image distortions. In the first section we list articles studying theimpact of image distortions on computer vision tasks, which also usually explore somepossible mitigating solutions. In the second section instead we describe some more complexsolutions, specifically designed to solve the image distortion issue in image classification.

3.1 Impact of Image Distortion on Computer Vision Tasks

Starting from the assumption that state-of-the-art Convolutional Neural Networks are com-monly trained, and evaluated, on large annotated datasets of artifact-free high-quality images[56] investigates the effect of optical blur artifacts on the network performance. The authorstested different blur types, which can derive from real-world situations such as cameradefocus, subject linear motion and camera shake. Their work shows that models trainedonly on high-quality images suffer a significant degradation in performance when applied tosamples degraded by all types of blur. Interestingly they show that there is a fair amount ofgeneralization across different types and amounts of blur, so fine-tuning the network with aspecific type and intensity of blur may also be beneficial for other types of blur. In general,however, they found that fine-tuning a pre-trained model with blurred images added to thetraining set allows it to regain much of the lost accuracy; this robustness derives from themodels learning to generate blur invariant representations in their hidden layers.

[8] determined how different kinds and intensities of image distortion affect CNNsperformance on image classification. They found all networks they tested to be susceptibleboth to blur and noise, but they observed that deeper networks performance falls off slowerthan the one of shallower networks. They concluded that deeper structures give the networkmore room to learn features that are not affected by noise. They also observed that blur does

Page 42: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

22 Related Work

not significantly affect early filter responses, but in spite of this the last layer activationsexhibit significant changes, so even slight differences in early layers propagate to deeperlayers. Noise on the other hand causes many activations in the first layer due to its highfrequency nature, and this translates in significant changes in the last layer responses.

3.2 Methods for Robust Image Classification

[60] performed an analysis on the effects of distortions on image classification, as otherworks we presented in the previous section, but they also proceeded to propose two solutionsto such problem: fine-tuning and retraining. In fine-tuning they start from a pre-trainedmodel and continue training the first N layers with distorted samples while keeping the restof the network fixed; when performing retraining instead they train the entire network onthe distorted dataset starting from randnom weights. They observed that both techniquesreduce the classification error rate of distorted samples, but such improvement in performancedepends on the network and training dataset size: If the number of trainable parameters islarge, fine-tuning is a better alternative than retraining since it is less prone to overfittingon small datasets. They also show that both fine-tuning and retraining tend to “adjust” theimage representation, making it similar to the representation of the undistorted image fromthe pre-trained model. To prove this, they show that, for blurred images, fine-tuning andretraining both increase the variance of the gradient of the activations, showing that suchactivations actually contain more information with respect to the ones produced by theoriginal model on the distorted images.

Following the work of [60], [3] proposed an alternative solution to improve the perfor-mance of pretrained CNNs on distorted images: The authors observed that for each layerof a CNN, certain filters are far more susceptible to input distortions than others and thatcorrecting the activations of these filters can help recover lost performance. Starting from thisobservation they rank the convolutional filters in order of the highest gain in classificationaccuracy upon correction; then, they proceed to correct the activations of the top rankedfilters by appending small blocks of stacked convolutional layers at their outputs, and trainingthem while keeping the rest of the network fixed. By doing so, they are able to significantlyimprove the robustness of the network against image distortions while reducing the number oftrainable parameters and achieving faster convergence in training with respect to fine-tuningentire layers. That said, the amount of parameters to train to implement this techniqueremains considerably high with respect to our approach; moreover, the ranking of the filtersis applicable only when the clean and noisy version of the same image is available, and thismay not always be a viable option in real world problems.

Page 43: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

3.2 Methods for Robust Image Classification 23

In [9] the authors study a deep neural networks ensemble method for classificationof images with quality distortions. Their work starts from the idea that deep networkperformance on poor quality images can be greatly improved if the network is fine-tuned withdistorted data. They found that the performance improvement of a network fine-tuned on asingle distortion usually does not generalize well across multiple distortion types. To mitigatethis issue they propose a mixture of experts-based ensemble model, the MixQualNet, that isrobust to multiple different types of distortions. This model is composed of “experts” trainedon a particular type of distortion whose output is then weighted and summed to produce afinal prediction. An independent gating network is in charge of weighting the expert output;given a particular distortion type and level, the network is trained to predict weights for eachexpert predictions. A very important property of this model resides in its ability to performwell even while being blind to the distortion level and type. This is very important as inmany applicative settings, the nature of the image distortion is unknown or very hard tomodel. However, a clear downside of this solution is for sure its complexity. To alleviatethis problem, the authors introduced weight sharing into the MixQualNet, in the form of theInverted TreeNet architecture. This modification to the architecture not only reduced thenumber of parameters required to train the model but also improved classification accuracy.

Page 44: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 4

Methodology

After having reviewed the main concepts and studies that constitute the foundations for theexperiments that we have performed, it is time to dive into a concrete explanation of ourmethodology, describing in detail the solution that we propose to deal with the problemdefined in Chapter 1.

4.1 Model Training on Source Dataset

As we previously discussed, our work focuses on shrinking the negative effects of thecovariate shift, caused by a generic image distortion function g(·), where, starting from aDNN model trained on a large set of pristine images, through a proper transfer learningtechnique, we eventually come up with a model that is robust to this shift, ultimately learninglatent features that are robust to this form of input data distortion g(·).

In order for us to come up with a solution for this problem, we first need a modelto use as a starting point. Let M be a convolutional neural network (CNN), trained on asource dataset DS of undistorted images, that carries out the task of image classification (i.e.,learning a mapping from an input image xS

i to its corresponding label ySi ). To avoid potential

misunderstandings later on, we will use the term "pre-trained" or "baseline" to refer to anynetwork that has been trained on undistorted images, throughout the rest of this paper.

Since this model M has only been trained to classify images in a "clean" data scenario,the model will not perform as well when tested on "noisy" data, as extensively proven in[8]. In fact, testing distorted images with a pre-trained DNN model, we observe that addingeven a small amount of distortion to the original image results in a misclassification, eventhough the added distortion does not hinder the human ability to classify the same images. Incases where the predicted label for a distorted image is correct, the prediction confidencedrops significantly as the distortion severity increases. Examples of this phenomenon can be

Page 45: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.1 Model Training on Source Dataset 25

baby 12.929

Original

baby 12.925

blur = 0.25

rocket 8.163

blur = 1.25

apple 8.379

blur = 2.25

baby 12.929

Original

baby 12.455

AWGN = 5

porcupine 8.335

AWGN = 15

keyboard 7.137

AWGN = 25

beaver 12.779

Original

beaver 12.776

blur = 0.25

baby 6.751

blur = 1.25

cup 6.856

blur = 2.25

beaver 12.779

Original

beaver 11.625

AWGN = 5

bee 7.360

AWGN = 15

forest 7.697

AWGN = 25

bottle 11.614

Original

bottle 11.609

blur = 0.25

couch 7.404

blur = 1.25

sea 11.340

blur = 2.25

bottle 11.614

Original

bottle 10.557

AWGN = 5

tractor 8.874

AWGN = 15

motorcycle 8.535

AWGN = 25

Figure 4.1 Effect of image quality on DNN predictions, with predicted label and confidencegenerated by a pre-trained All-Conv Net [52] model. Distortion severity increases fromleft to right, with the left-most image in a row having no distortion (original). Green textindicates correct classification, while red denotes misclassification. Left: Examples from theCIFAR-100 test set distorted by Gaussian blur. Right: Examples from the CIFAR-100 testset distorted by Additive White Gaussian Noise (AWGN).

observed in Figure 4.1. As we can see, independently on the type of distortion applied, themore the distortion severity increases, the more the confidence in the correct prediction isreduced, sometimes also leading to misclassification.

A concise way to observe the impact of image quality on DNN classification performanceis to visualize the distribution of classes in the feature space learnt by this DNN model.Since DNNs learn high dimensional feature spaces that are difficult to visualize, we use thevisualization method proposed in [35] to embedded the high dimensional features from the3600-dimensional last convolutional layer of our All-Convolutional Net [52] implementation(details on this network will be provided in Chapter 5), to a 2-dimensional feature space for10 image classes from the CIFAR-100 dataset.

Figure 4.2 shows the effect of image quality on the discriminative power of the featureslearnt by this model trained on high quality images, through t-SNE embeddings of objectclasses affected by different types and levels of distortion. One can see that the features of thepre-trained model are discriminative enough to generate a concise clustering of high-qualityimages from the same class and also provide a good separation between clusters of otherclasses. However, the addition of distortion not only reduces the separation between clustersbut also causes the clusters to become wider, eventually resulting in most clusters mappinginto each other and thus becoming inseparable as the distortion severity increases. Thisindicates that features learnt from a dataset of high quality images are not invariant to image

Page 46: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

26 Methodology

Original Blur = 0.25 Blur = 1.25 Blur = 2.25

Original AWGN = 5 AWGN = 15 AWGN = 25

Figure 4.2 Two-dimensional t-SNE embedding of the last layer features of the All-Conv net,visualized for original (undistorted), blurred and noise affected images of 10 classes fromthe CIFAR-100 test set, with each color representing a separate class and distortion severityincreasing from left to right. Each point in the embedding represents an image in the 10 classsubset, with 100 images per class. Top row: Embedding for Gaussian blur affected images.Bottom row: Embedding for AWGN affected images.

distortion or noise and cannot be directly used for applications where the quality of images isdifferent than that of the training images.

Given these results, some issues immediately arise: For a network trained on undistortedimages, are all convolutional filters in the network equally susceptible to noise or blur inthe input image? Is it possible to identify and rank the convolutional filters that are mostsusceptible to image distortions and recover the lost performance by only correcting thevalues of such ranked filters? The upcoming section is devoted to provide an answer to all ofthese questions.

4.2 Noise Impact Analysis

Since our study addresses the problem of image classification, we focused our attentionto the portion of the network that is responsible to learn the latent representation of everyinput image. To this regard, our study addresses the problem of finding a quantitativemeasure to determine the change introduced by noise, with respect to the clean setting, at theconvolutional layers level. Being the convolutional layer the core building block to learn thelatent representation of every input image, the comparison is performed at the 2-dimensionalactivation maps level, as we are willing to evaluate how noise impacts the learned latentrepresentation of every image.

Page 47: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.2 Noise Impact Analysis 27

Sharp AWGN

Sharp Blur

(a) Associative pairs

Sharp AWGN

Sharp Blur

(b) Non Associative pairs

Figure 4.3 Examples of associative and non associative pairs. Sharp and distorted pairs referto the same images in the associative case. This does not hold for the non associative case,as the images in the pair are totally uncorrelated.

To compute this difference, a (clean, noisy) image pair is necessary to perform thecomparison. For empirical reasons, it is crucial to make a substantial splitting, dependingon the setting in which we want to take advantage of the proposed technique: We will referto associative pairs whenever the (clean, noisy) image pair concerns the same image. Inthis scenario, adopting the same notation of Chapter 1, being xC

i ∈ DC the clean image, andxT

i ∈ DT the noisy one, then xTi = g(xC

i ), where g(·) is the function for the input distortion.We will instead refer to non associative pairs whenever this (clean, noisy) image pair will

not refer to the same image (xTi = g(xC

i )), to accommodate circumstances in which the cleanand noisy pair of the same image will not be available. Examples of images that display thisdistinction can be observed in Figure 4.3.

Being the non associative case a slight variation of the associative one, we present themethodology assuming an associative setting. Details on the non associative scenario will begiven in Section 4.4.

4.2.1 Measuring Filters’ Susceptibility to Input Data Distortion

The fundamental objective of the noise impact analysis is to measure which are the featuremaps, for each convolutional layer, that are most affected by input data distortion. Simplyput, the goal is to measure which are the feature maps that vary the most between theirclean and noisy counterparts. This variation is a measure of how much each convolutionalfilter, which generated each corresponding feature map activation, is sensitive to input data

Page 48: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

28 Methodology

Di = [d0, ... , dF]Pre-trainedCNN

Distorted samples

Clean samples

Clean activations

Distorted activations

EMDF

. .

0

F .

. 0

Networktrained on

clean samples

Figure 4.4 High level diagram of the associative method. The clean and noisy version of thesame image are fed into the same baseline model, producing two different sets of featuremaps activations, which are then compared using the EMD metric to generate an array ofdistances, where each element represents the scalar distance between each of the singlefeature map activations in the investigated layer.

distortion, indicating how much each filter is responsible for the performance drop of thebaseline model when tested on noisy inputs.

To measure this variation between feature maps activations, because of its informal intu-ition, we adopted the Earth mover’s distance (EMD), which is also known, in mathematics, asthe Wasserstein metric. The metric was first introduced by Rubner et al. [49] on applicationsto image datatabases, in order to provide a consistent measure of distance, or dissimilarity,between two distributions of points in a space for which a ground distance is given. Animage yields a distribution in color space by mapping each pixel of the image to its color.Consequently, the EMD proved to be an appropriate metric for image retrieval in imagedatabases. Along these lines, we decided to adopt this metric as our measure of variation ofconvolutional filters activations.

Starting from the baseline model, trained on pristine images, discussed in Section 4.1,alongside all of the N (clean, noisy) image pairs, where N represents the size of the trainingdataset, we are able to perform the assessment of filters’ susceptibility to input data distortion,adopting the just defined Earth mover’s distance. Summarized in Figure 4.4, is an exampleof our methodology on a single (clean, noisy) image pair. Given a convolutional layer l, wecan extract, from the same baseline model M, two sets of activation maps, generated by theconvolutional filters present in layer l: one when the clean image from the pair is fed into the

Page 49: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.2 Noise Impact Analysis 29

network M, and one when the noisy image is provided as input to the same baseline model.We now have two sets of 2-dimensional activation maps, each of cardinality F , where Fequals the number of convolutional filters in layer l (see Clean activations and Distortedactivations in Figure 4.4 as a reference for the two sets). It is important to note that saidactivations are collected at the output of the convolutional filters, after the nonlinear activationfunction has been applied - element-wise - on the 2-dimensional activation maps. If we flatteninto one dimension each of the computed activations, the two sets of now 1-dimensionalmaps are ready to be compared with the EMD metric, since they represent 1D distributions.

Going back to the intuitive explanation of the Earth mover’s distance, given two distribu-tions, one can be seen as a mass of Earth properly spread in space, the other as a collectionof holes in that same space. We can always assume that there is at least as much Earth asneeded to fill all the holes to capacity by switching what we call Earth and what we callholes if necessary. Then, the EMD measures the least amount of work needed to fill the holeswith Earth. If we try to look at the "noise" in the activations as the "dirt" or "Earth" to berearranged, we can see how this informal intuition, embodied by this metric, perfectly fitsour needs and expectations.

Besides this intuitive reasoning, there are several other arguments that we can make tosupport our decision: If we look for example at the advantages of the Wasserstein metric (i.e.,Earth mover’s distance) compared to the popular Kullback–Leibler divergence [30], the mostobvious one is that W is a metric whereas KL divergence is not, since KL is not symmetric,which would pose a big problem in determining which distribution to use to compare itagainst the other. As what comes to practical difference, one of the most important ones isthat unlike KL (and many other measures), Wasserstein takes into account the metric space.What this means in less abstract terms is perhaps best explained by the example in Figure 4.5:The measures between red and blue distributions are the same for KL divergence whereasWasserstein distance measures the work required to transport the probability mass from thered state to the blue state using the x-axis as the "road". This measure is obviously larger thefurther away the probability mass is, hence the alias Earth mover’s distance [5].

For what concerns computing the EMD, this metric is based on a solution to the oldtransportation problem [6]. This is a bipartite network flow problem which can be formalizedas a linear programming problem, which means that it can be solved using any algorithm forminimum cost flow problem.

Once this computation has been performed, as it is conceivable from Figure 4.4, we nowhave a list of scalar values, of length F , where each scalar element represents how dissimilareach activation map is between its clean and noisy version.

Page 50: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

30 Methodology

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prob

abilit

y

1 2 3 4 5x

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prob

abilit

y

Wasserstein distance 2.0KL divergence 0.8959

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5x

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Wasserstein distance 0.5KL divergence 0.8959

Figure 4.5 Advantages of Wasserstein metric compared to Kullback–Leibler divergence.

On a purely iterative fashion, repeating the comparison for each (xCi , xT

i ) pair of trainingimages, we can now compute, for each convolutional layer l, the EMD metric betweenthe feature map activations generated when the clean version of the image is fed into thebaseline model M, and the feature maps activations generated when the noisy version of thesame image is fed into the network. The result of this computation will be a rectangulardistance matrix D, of shape N x F , where N represents the size of the training dataset(N = |DC|= |DT |), while F describes the number of feature maps activations present in theexamined convolutional layer l. For each convolutional layer l in the network, there will be amatrix of this kind, which represents, for each image at index i ∈ 1, ...,N, how sensitive iseach convolutional filter, at index j ∈ 1, ...,F, to the applied image distortion.

4.2.2 Ranking Filters by Susceptibility to Input Data Distortion

The remaining aspect of the noise impact analysis consists in aggregating all the computeddistances: now that for each layer of the network we have N different opinions on whichare the feature maps that are most affected by noise, how do we aggregate all of them?Associating a large value of EMD to a greater change due to noise, the objective is to producea final ranking that sorts, from the most changing feature map, to the least changing one, allthe feature maps indices in the examined convolutional layer l.

Page 51: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.2 Noise Impact Analysis 31

Algorithm 1 Ranking convolutional filters per layer with Borda countInput: convolutional layer l, input images samplesfor f ilter in l do

Initialize borda_count of f ilter = 0end forfor (clean_img,noisy_img) in samples do

distance = compute_distance(l, (clean_img,noisy_img)) distance between featuremaps activations in layer lranked_ f ilters = argsort(distance)Truncate ranked_ f ilters at 10 itemsInitialize idx = 0while idx < 10 do

candidate_ f ilter = ranked_ f ilters at position idxpoints = 10 - idxIncrease borda_count of candidate_ f ilter by pointsIncrease idx by 1

end whileend for

In a democracy, how do citizens determine who is going to represent them in the nation’sgovernment? Through voting! That is precisely what we did. As a matter of fact, toaccomplish the task of ranking which are the convolutional filters that are most susceptible toimage distortions, we relied on an election method known by the name of Borda count [34].This voting technique is a family of single-winner election methods in which voters rankoptions or candidates in order of preference. It was first developed independently severaltimes, as early as 1435 by Nicholas of Cusa [10], but is named after the 18th-century Frenchmathematician and naval engineer Jean-Charles de Borda, who devised the system in 1770[34]. The voting system works like this: Voters (in our case, the N training image pairs) rankthe list of candidates (F feature maps in layer l) in order of preference. Points are given toeach candidate in reverse proportion to their ranking, so that higher-ranked candidates receivemore points. When all votes have been counted, and the points added up, the candidate withthe majority of the points wins. In our setting, candidates receive 10 points each time theyare ranked first, 9 for being ranked second, and so on, with the 10th candidate receiving just 1point, while the remaining ones 0. The pseudo-code of the just detailed methodology canbe examined through Algorithm 1. At the end of this voting operation, each feature map inlayer l has its own Borda count, which is nothing but a set of points: the higher is this value,the higher is the corresponding convolutional filter considered to be susceptible to input datadistortion.

Page 52: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

32 Methodology

The reason why this voting system was adopted consisted in coming up with a uniqueranking that sorts, by the value of their corresponding Borda counts, each of the convolutionalfilters in each convolutional layer l. Through a very simple argsort operation, that returnsthe indices that would sort the list of Borda counts descendingly, we are able to come upwith the desired ranking. Because this voting technique tends to elect broadly-acceptablecandidates, rather than those preferred by a majority (i.e. only the one that varies the most,the majority of the times), Borda count proved to be the most successful technique to providethe ranking for the feature maps most affected by noise.

Repeating this process for each convolutional layer that we would like to analyze, we arethen able to collect a set of rankings which, examined individually, are nothing but a list ofindices j ∈ 1, ...,F that list convolutional filters in order of their susceptibility to input datadistortion, which is exactly what we wanted to achieve through our noise impact analysis. Infact, going back to the questions that we left open in Section 4.1, we have now been able todemonstrate that, for a network trained on undistorted images, not all convolutional filters inthe network are equally susceptible to noise in the input image, and, through our proposedtechnique, we are actually able to identify and rank the convolutional filters that are mostsusceptible to image distortions.

The upcoming section is going to illustrate how this set of rankings is put into use torecover the lost performance of the baseline model when tested on noisy inputs, providing ananswer to the latest of the aforementioned questions.

4.3 Model Fine-Tuning on Target Dataset

Before moving on to present the last and fundamental step of our methodology, it is essentialto introduce to the reader the idea that gave us the intuition behind our selective filtersfine-tuning approach.

4.3.1 Activation Maps Swapping

Let’s define the output of a single convolutional filter φi, j to the input xi by φi, j(xi), where iand j correspond to layer number and filter number, respectively. If gi(·) is a transformationthat models the distortion acting on filter input xi, then the output of a convolutional filterφi, j to the distortion affected input is given by φi, j(xi) = φi, j(gi(xi)). It should be noted thatφi, j(xi) represents the filter activations generated by distorted inputs and φi, j(xi) representsthe filter activations for undistorted inputs. Assuming we have access to φi, j(xi) for a givenset of input images, replacing φi, j(xi) with φi, j(xi) in a deep network is akin to perfectly

Page 53: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.3 Model Fine-Tuning on Target Dataset 33

correcting the activations of the convolutional filter φi, j against input image distortions.Computing the output predictions by swapping a distortion affected filter output with itscorresponding clean output for each of the ranked filters would improve classificationperformance. Intuitively, the more the convolutional filter is susceptible to input imagedistortion, the more its corresponding output is going to negatively impact the classificationperformance of the model when tested on noisy data. Consequently, the idea is to swapconvolutional filters according to the previously computed ranking, calculated under thecircumstances indicated by the rules provided in Section 4.2. Doing so, we are able to fixthe output of the "harmful" convolutional filters, directly acting on the parts of the networkthat do not behave according to the expectations. An high level representation of the justmentioned approach can be observed in Figure 4.6.

A very interesting remark in which we stumbled upon while performing this techniquewas that this swapping methodology turned out to be a potential regularization techniquefor the baseline model. What the results of our experiments have shown is the fact that,depending on the baseline model and the number of convolutional feature maps swapped, theclassification performance of the model, when tested on noisy data, would outperform theperformance of the baseline model when tested on clean data. If, for example, we swappedabout 80% of the convolutional feature maps of one of the early layers of the network, wewould see the classification accuracy of the model when tested on noisy data being betterthan the one computed on clean data. This result seemed quite counterintuitive, but afterfurther analysis, suggested that the baseline model, in cases were this phenomenon wastaking place, was probably suffering from overfitting. The reason we say this is because,what we ended up doing through the swapping of the majority of the noisy feature mapswith their clean counterparts, was one form of noise injection. As extensively discussedin [14], noise injection is an efficient technique to mitigate over-fitting in neural networks.Just as dropout [53], which is an example of a successful regularizer that injects noise tohidden units during training, this swapping technique proved to belong to the same familyof regularizers. In fact, if we deeply dive into what we ended up doing swapping so manyfeature maps in the single convolutional layer, was leading to a baseline model that wouldhave 80% of its feature maps unchanged, while the remaining 20% being the noisy one.These distorted feature maps acted as noise injected to the outputs of the hidden units, justlike dropout, improving the generalization performance of the model, preventing overfitting.

One of the other interesting aspects that really stands out about this swapping methodologyis the fact that no training is required to be able to adopt it. In fact, once the baseline modelhas been trained, and the rankings of the most susceptible filters have been computed, nofurther fine-tuning needs to be performed on the model. All of the modifications, required

Page 54: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

34 Methodology

Swap feature maps generated by the

filters mostsusceptible to input

data distortion

F .

. 5

4

F

Pre-trainedCNN

Same networktrained on

clean samples

3 2

1 0

F .

. 5

4

F

Pre-trainedCNN

3 2

1 0

F .

. 5

4

F

3 2

1 0

Fully-Connected Classifier

Cleanimage

Distortedimage

Cleanactivations

Noisyactivations

Noisyactivationscorrected

Figure 4.6 High level diagram representing the feature maps swapping technique. A (clean,noisy) pair of the same image is needed, along with the ranking of the convolutional filters,for each layer, that orders them by their distortion susceptibility. Once the feature maps havebeen computed for both images, we proceed by swapping the distorted feature maps withtheir undistorted counterparts, according to the order imposed by the ranking.

by the swapping, are done during the testing phase. However, this technique suffers fromone major disadvantage, which basically prevents the applicability of the methodology ina real world scenario: Besides the question concerning how many feature maps to swap inthe convolutional layer, which is still relevant but does not impact the applicability of themethod, a (clean, noisy) pair is required for every testing image. While at training time,dealing with this necessity to compute the rankings of the filters most affected by noise isstill manageable, the majority of the times this does not hold at testing time. Still, if no cleanversion of the image is available, no swapping can take place, so this methodology turns outto be totally worthless.

This massive obstacle that we just introduced forced us to look at a different solution toour problem of improving the robustness of the baseline model to image distortion, and thisis why we turned to selective filter-level fine-tuning, that we will describe in the upcomingsection.

4.3.2 Selective Filter-Level Fine-Tuning

As just mentioned in the previous section, activation maps swapping is definitely a promisingtechnique, but it is massively constrained - in terms of applicability - by the requirement of a(clean, noisy) pair for every image in the testing set, which is clearly unlikely to happen. Forthis reason, we had to come up with a different methodology, that would still take advantageof the ranking that assesses sensitivity to image distortion of individual convolutional filters -

Page 55: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.4 Non Associative Method 35

described in Section 4.2 - but with the clear goal of overcoming the restriction imposed by theswapping procedure. For this reason, we decided to turn to selective filter-level fine-tuning.

Parameters fine-tuning in convolutional neural networks is usually performed takinga previously trained network, and retraining some of its convolutional layers, in order toadapt them to the target task or domain. Since our objective is not to accomplish a new task,but rather to mitigate the noise impact on the source model, we retrain only the filters thatproduced the most changing activation maps, instead of all the filters in the layer.

In details, we train a "partially trainable" neural network, where, unlike the standardtransfer learning approach of "freezing" at the layer level (i.e. keeping constant all theparameters for the layer), we do it at the filter level. As a result, within the same convolutionallayer, some filters are re-trainable, while others are not. To train the network, we use thetarget training dataset, where all the samples used for training have been subject to imagedistortion, enhancing the robustness of the baseline model to noise in the input data. In fact,the intuition behind this approach is that, by only acting on the filters that are most affectedby noise (i.e. the most corrupted ones), we directly operate on the parts of the network thatare most responsible for the performance drop when moving from a clean data setting to anoisy one, fixing their feature extraction capabilities.

Even though, unlike the swapping methodology, this procedure requires an additionaltraining phase to fine-tune the selected convolutional filters, it is immediately clear howthis technique overcomes the massive restriction of requiring a (clean, noisy) pair for everyimage in the testing set. In fact, once the convolutional filters have been ranked in order ofhighest susceptibility to input distortion - which is the phase that requires a set of (clean,noisy) image pairs to compute it - then, we do not need any (clean, noisy) pair of testingimage to put this model to use. As a matter of fact, once the model has been fine-tuned, itis able to produce prediction independently on the type of image (distorted or not) that isprovided as input. In addition to this, even though this supplementary training phase is apotential downside of the technique, it is important to note that since only a portion of theentire network parameters are set as "trainable", the entire training process is going to lastonly a small fraction of the time that was required to train the entire network.

4.4 Non Associative Method

In the previous section, we demonstrated how the very restrictive limitation imposed by theactivation maps swapping methodology, which required a (clean, noisy) pair for each imagein the testing set, could be overcome by the selective filters fine-tuning technique. Eventhough such technique is able to get over this impediment, it still requires, in order for it to

Page 56: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

36 Methodology

be applicable, a ranking of the most noise susceptible convolutional filters in the network.This ranking, as we proved in Section 4.2, needs a set of (clean, noisy) pairs of trainingimages in order to be computed. In fact, not having the (clean, noisy) input pair for everytraining image would prevent the applicability of the technique, as it would not be possibleto compute the EMD between the activation maps of the clean and noisy image pairs.

Despite being a considerably less restrictive constraint than having this set of pairs foreach testing image, in practical applications, it might not always be possible to have theseimages available even at training time. As we first mentioned in Section 4.2, we definethis setting as non associative. In this non associative setting, each noisy image xT

i is notgenerated applying the distortion function g(·) to its corresponding clean version xC

i . Instead,it is an independently collected noisy image, of which its corresponding clean version is notavailable. Formally, xT

i = g(xCi ).

For us to be able to provide a solution for every practical application of our transferlearning technique, we had to rely on the idea of finding images that would serve as rep-resentatives for a given class of images, where the class labels, in this case, would not betheir actual categories. Instead, images would be split based on whether they were noisyor not. The two representatives would then be compared against each other to compute theEarth mover’s distance, so each convolutional filter could be ranked by their susceptibility toimage distortion. With this clear goal in mind, we decided to rely on the concept of clustersexemplars to find these representative images.

4.4.1 Finding Representative Images

Spatial clustering is the process of grouping a set of objects into classes or clusters so thatobjects within a cluster have high similarity in comparison to one another, but are dissimilarto objects in other clusters [18]. Clustering methods can be themselves divided into groups,between those that are able to automatically determine the number of clusters within the data,and those that do not, which require an hyperparameter to be set that looks for a specificnumber of clusters to group the data in. One of the clustering methods that belongs to thissecond category of clustering methods is k-medoids.

The k-medoids algorithm is a clustering algorithm that breaks the dataset up into groups,attempting to minimize the distance between points labeled to be in a cluster and a pointdesignated as the center of that cluster [25]. In contrast to the more popular k-means algorithm,k-medoids chooses data points as centers, also named medoids, that are defined as membersof the input set that are the most representative for the assigned cluster. When centers areselected from actual data points, they are also called exemplars [12]. This popular k-centers

Page 57: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

4.4 Non Associative Method 37

clustering technique works by initially selecting randomly a set of selected exemplars anditeratively refining this set so as to decrease the sum of squared errors.

We preferred k-medoids to other clustering techniques that find exemplars, like affinitypropagation [12], because through the k hyperparameter, we are able to fix the number ofclusters to find. Since we are only interested in finding one exemplar per dataset, we fixk = 1, which is the equivalent of finding the exemplar that lies in the middle of the dataset,in its multidimensional vector space.

To find the just mentioned exemplars, we divide our training data into two separatedatasets: one composed only by clean images (DC) and the other one being the set ofnoisy images (DT ). In a totally unsupervised fashion, it possible to apply this clusteringtechnique to each of the two separate datasets, that are now perfectly independent, to finda representative (exemplar) image for each dataset: one being the exemplar for the cleandataset xC

ex ∈ DC, while the other one being the exemplar for the noisy one xTex ∈ DT .

Going back to the approach defined in Section 4.2.1, we can now compute the distanceamong feature maps activations, for any given convolutional layer l, between the exemplarimage for the clean dataset xC

ex, and the exemplar image for the noisy dataset xTex. The

result, differently from the associative scenario, is not a distance matrix anymore, as it isthe outcome of a single (clean, noisy) pair and not an entire set of images. In fact, it is aone dimensional vector of scalar values, each representing the distance of each feature mapactivation at index j ∈ 1, ...,F, of the convolutional layer l, between the clean and noisyexemplar.

4.4.2 Ranking Filters by Susceptibility to Input Data Distortion

At this point, just as in the associative scenario, once the distances between feature mapsactivations have been computed, it is time to rank the convolutional filters by their suscep-tibility to image distortion. Being the computed distance vector a one dimensional array,there is no need to adopt any voting technique like in the associative approach anymore. Infact, it is now possible to simply calculate which are the indices that would sort the vectordescendingly (argsort function), so as to come up with the final ranking of the most affectedfeature maps activations for the convolutional layer l.

Once the ranking has been estimated, the pipeline is exactly identical to the associativecase. In fact, the produced ranking is the building block for the subsequent and final step,which consists in fine-tuning only a subset of the convolutional layer filters, corresponding tothe most affected ones by noise, as detailed in Section 4.3.2.

Page 58: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 5

Implementation

Once the entire methodology behind our thesis has been described in detail, it is necessary tooutline how the entire system has actually been implemented. We will structure this chapteraccording to the same format of the previous chapter (Chapter 4), with the addition of anintroductory section, in order to provide a clear explanation of how each sub-component ofthe methodology was actually developed.

5.1 Source Code and Deployment

The source code of the entire project was structured as a Python project, mostly because of theextensive availability of machine learning and deep learning libraries that are implemented insuch programming language. In fact, Python is often the language of choice for developerswho need to apply statistical techniques or data analysis in their work. It is also used by datascientists whose tasks need to be integrated with web apps or production environments. Itscombination of consistent syntax, shorter development time and flexibility makes it well-suited to developing sophisticated models and prediction engines that have Python to reallyshine in the field of machine learning.

Before completing the entire pipeline of the structured Python project, the developmentof the prototype and the experiments and evaluation of such prototype were run on a cloud-based environment know by the name of Google Colaboratory [16]. Colaboratory is a freeJupyter notebook environment that requires no setup and runs entirely in the cloud. JupyterNotebook is an open-source web application that allows users to create and share documentsthat contain live code, equations, visualizations and narrative text. Use cases include: datacleaning and transformation, numerical simulation, statistical modeling, data visualization,machine learning, and much more [27]. The advantage of Colab is that such Notebook is runon the cloud, enabling the possibility of writing and executing code, saving and sharing the

Page 59: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

5.2 Model Training on Source Dataset 39

analyses, and accessing powerful computing resources, on the cloud, instead of locally, allfor free and with a web-based interface.

Despite all these benefits, the massive advantage of Google Colab over Jupyter proved tobe the free accessibility to GPU based computation, which provided a massive speedup totrain our convolutional neural network models. In fact, Google Colab allows users to developdeep learning applications on the NVIDIA Tesla K80 GPU, entirely for free, which yields anextensive boost in performance over the standard CPU computation.

Even though Google Colab was the go-to solution for our "trial and error" phase, Colabis intended for interactive use. Long-running background computations, particularly onGPUs, may be stopped by the system’s provider [17]. Because of this, for the trainingand deployment of the final project, which included the fine-tuning experiments of all theconfigurations described in Chapter 6, we had to rely on a more stable solution, which wouldnot incur into computational restrictions. For affordability and ease-of-use, we decided tocount on the Deep Learning VM provided by Google Cloud Platform [15]. This preconfiguredvirtual machine makes it easy and fast to instantiate a VM image containing the most populardeep learning and machine learning frameworks on a Google Compute Engine instance.This deployment automates out the hassle of setting up a high-performance computingenvironment, with the latest NVIDIA GPU libraries and latest Intel libraries all ready to go,along with the latest supported drivers. Such configuration included 2 vCPU and an NVIDIATesla K80 GPU with 13GB of RAM, which proved to be enough for our computationalrequirements.

Now that we listed how our pipeline was deployed and executed, we leave the followingsections to go a little bit more into detail of how we developed all the individual sub-components of our technique, as we introduced them in the previous chapter.

5.2 Model Training on Source Dataset

Besides the development of all the general aspects of our pipeline which were written inPython, in order to implement the two Convolutional Neural Network models, we had torely on a deep learning framework to achieve this task. Because of its simplicity to code andease of learning, we picked the deep learning framework developed by Facebook‘s artificialintelligence research group, known by the name of PyTorch [43]. PyTorch is an open-sourcemachine learning library inspired by Torch, which has been optimized to perform tensorcalculations using GPUs and CPUs. One of the great advantages of PyTorch, over the otherdeep learning frameworks, is that it builds deep learning applications on top of dynamicgraphs which can be played with at runtime. Other popular deep learning frameworks work

Page 60: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

40 Implementation

on static graphs where computational graphs have to be built beforehand. The user doesnot have the ability to see what the GPU or CPU processing the graph is doing. Whereasin PyTorch, each and every level of computation can be accessed and peaked at. Thiscomputational graph that in PyTorch is defined at runtime makes the code much easier todebug, as all Python debugging tools (e.g. pdb) can be used with freedom to potentiallydebug PyTorch code too.

Another important advantage of PyTorch, which extremely facilitated the transitionto this deep learning framework, was the similarity between PyTorch’s Tensor object andNumPy’s array. NumPy is a library for the Python programming language, that adds supportfor large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays [40]. It is based on a powerfulN-dimensional array object, which is conceptually identical to PyTorch’s Tensor object [46].Numpy proved to be a great framework and many of the functions which we implementedwere based on this library, but NumPy cannot utilize GPUs to accelerate its numericalcomputations. For modern deep neural networks, GPUs often provide speedups of 50x orgreater, so unfortunately NumPy was not enough for our needs. Instead, PyTorch Tensors,which as we said are conceptually identical to Numpy’s array, can utilize GPUs to acceleratetheir numeric computations, and this is why this very powerful deep learning frameworkended up being the most effective solution for our requirements.

With the adoption of PyTorch, we defined the two network architectures, which we willdetail in Chapter 6, of the models that we have trained on undistorted images, that served asthe baseline models for our fine-tuning approach. The definition of the two Python classesthat implement the two networks can be appreciated in Section A.1 of Appendix A.

5.3 Noise Impact Analysis

For what concerns the noise impact analysis aspects of our methodology, as we describedin Chapter 4, two steps are involved in the process: first, we must compute the distancebetween clean and noisy feature maps activations per layer. We achieve this through theEarth mover’s distance, which in mathematics is known as the Wasserstein metric. It isactually by this name that we are able to compute the metric. In fact, the SciPy library [23],which is the fundamental library for scientific computing, provides a fast implementation ofthe first Wasserstein distance between two 1D distributions. The function’s prototype is thefollowing:

Page 61: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

5.4 Model Fine-Tuning on Target Dataset 41

scipy.stats.wasserstein_distance(u_values, v_values, u_weights=None,v_weights=None)→

For our computations, we leave u_weights and v_weights - which represent the weightfor each value - at None. Doing so, each value is assigned the same weight. Listing 3 servesthe purpose of illustrating how this function is put to use, to compute the distance betweenclean and noisy feature maps activations, for each convolutional layer of interest.

The distance matrix that is returned by the just mentioned computation is essential tomove on to the second important aspect of the noise impact analysis, that consists in rankingconvolutional filters by their susceptibility to image distortion in the input data. Listing 4delineates how, through the NumPy library and Python’s collection module, we are ableto implement the previously described Borda count election method. The output of suchfunction is the desired ranking of most susceptible convolutional filters - of the convolutionallayer of interest - which is the fundamental ingredient to move on to our selective filter-levelfine-tuning methodology.

5.4 Model Fine-Tuning on Target Dataset

Once the rankings of the convolutional filters that we want to fine-tune have been computed,we can actually perform the fine-tuning methodology, described in Section 4.3, with ourtarget dataset of distorted images. The technologies that were adopted for this segment of thepipeline are congruent with the ones mentioned in Section 5.2. The only difference is thatthis time, we have to alter the networks architectures to enable the possibility of propagatingthe gradient - thus enabling the training - of only some of the convolutional filters per layer.This feature is not included in PyTorch library, as it only allows to have the entire layer to benon-trainable. Because of this, we developed a workaround, which consists in splitting theset of convolutional filters in each convolutional layer that we want to fine-tune in two parts:one were parameters are left trainable (i.e. the filters most susceptible to image distortion),and one were the layer’s parameters are fixed and kept the same as those in the pre-trainedmodels, which we also define as "frozen" parameters. Once the separated convolutionalfilters have completed their computations, the outputs (i.e. feature maps activations) aremerged, converging into the same set of feature maps activation as if they were coming fromthe computation of a single convolutional layer. To observe how we have achieved this task,the reader can take a look at the network definitions of the models that contemplate thisselective filters-level fine-tuning in Section A.3 of Appendix A.

Page 62: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

42 Implementation

5.5 Non Associative Method

The essential remaining feature of our entire proposed methodology has to deal with the nonassociative approach. As we specified in the previous chapter, this technique relies on theconcept of cluster exemplars, which are defined as the members of the input set that are mostrepresentative for the assigned cluster.

To achieve the goal of finding a "clean" cluster exemplar and a "noisy" cluster exemplar,the first step relies on gathering a set of undistorted training images and a set of noisy ones,so that the two sets are perfectly disjoint, meaning that the two sets should not include animage that is "clean" in one set, and "noisy" in the other one. The two sets should not haveanything in common.

Once the training data has been properly split, it is possible to individually apply theselected clustering methodology to find the cluster exemplar that represents the input data.First, we must compute the pairwise distances between the available observations, in orderto provide these distances as input to the clustering algorithm. To accomplish this, we relyon another greatly popular machine learning library known by the name of Scikit-learn [44].Scikit-learn is an open source machine learning library for the Python programming language,which features various supervised and unsupervised learning algorithms that is designed tointeroperate with the Python numerical and scientific libraries NumPy and SciPy. In this case,since we want to achieve the task of computing pairwise distances between training samples,we make use of Scikit’s pairwise_distances function, whose prototype is the following:

sklearn.metrics.pairwise.pairwise_distances(X, Y=None,metric='euclidean', n_jobs=1, **kwds)→

The function computes the distance matrix from a vector array X and optional Y . In ourcase, Y is not provided as input to the function. As a result, when Y is None, the functionreturns a distance matrix D such that Di, j is the distance between the ith and jth vectors ofthe given matrix X . It is also possible to specify which metric to use when calculating thedistance between instances in a feature array. We will detail in Chapter 6 which are themetrics that we have used for our analysis.

Now that the distance matrix has been computed, it is possible to move on with theclustering algorithm. In our case, as previously described, we decided to use the k-medoidsclustering algorithm, with k = 1. Our version of the just mentioned clustering algorithm canbe examined in Section A.4 of Appendix A.

Once the data has been clustered, it is possible to extract the largest cluster exemplarfrom each run of the clustering algorithm, resulting in the availability of two images:

Page 63: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

5.5 Non Associative Method 43

• Image that represents the exemplar of the clean dataset;

• Image that represents the exemplar of the noisy dataset.

With these two elements, following the steps indicated in Section 4.4, we are now able toobtain the list of the convolutional filters that are most susceptible to input data distortion,for each convolutional layer of interest.

Page 64: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 6

Experiments

In this chapter, we will take a closer look at the experiments we have performed to validatethe performance and accuracy of our assumptions and methodology. We start off with theexperimental setting, in which we detail all the necessary steps to certify the correctnessof our statements. Then, we continue by showing in detail the results of our experiments,paving the way for a final discussion on the results themselves, and the different trade-offsthat have to be considered in order to apply the proposed technique.

6.1 Experimental Setting

Just as it was mentioned in the introductory paragraph of the chapter, this section is devoted todescribe the various datasets, image distortions, network architectures and clustering detailsused to validate the proposed transfer learning technique.

6.1.1 Datasets

We used two popular image classification datasets: CIFAR-10 and CIFAR-100 [28]. CIFAR-10 consists of 60000 32x32 pixels colour images in 10 classes, with 6000 images per class.There are 50000 training images and 10000 test images. The training set contains exactly5000 images from each class, while the test set contains exactly 1000 randomly-selectedimages from each class. The classes are completely mutually exclusive (e.g. there is nooverlap between automobiles and trucks). The CIFAR-100 dataset is just like the CIFAR-10,except it has 100 classes containing 600 images each. There are 500 training images and 100testing images per class. We split both, CIFAR-10 and CIFAR-100, using an 80-20 ratio:40000 images for training and 10000 for validation.

Page 65: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.1 Experimental Setting 45

6.1.2 Distortions

We focus on evaluating two important and conflicting types of image distortions: Gaussianblur and Additive White Gaussian Noise (AWGN), each over 3 levels of distortion severity.Gaussian blur, often encountered during image acquisition and compression, representsa distortion that eliminates high frequency discriminative object features like edges andcontours, whereas AWGN is commonly used to model additive noise encountered duringimage acquisition and transmission.

Since we use datasets with the same input resolution, we use identical sets of distortionparameters for each dataset. For AWGN, we use a noise standard deviation σg ∈ 5,15,25and µg = 0, clipping every pixel element within the [0-255] range. For the Gaussian blur, weinstead use a standard deviation σb ∈ 0.25,1.25,2.25. For both CIFAR-10 and CIFAR-100,the size of the blur kernel is set to 4 times the blur standard deviation σb.

6.1.3 Network Architectures

Due to the larger number of classes to separate from between CIFAR-10 and CIFAR-100datasets, which accounts for a greater level of complexity in classifying each input image, weused two different network architectures, specifically: a simple convolutional neural networkwith two pairs of convolutional and max pooling layers, followed by a fully connected layerwith a final 10-way softmax layer for CIFAR-10, and a fully-convolutional network thatconsists of only convolutional layers with a final 100-way softmax layer for CIFAR-100. Weadopt the term "pre-trained" or "baseline" network to refer to any network that is trained onundistorted images.

The architecture for our simple convolutional network, which serves as our baseline modelfor CIFAR-10 dataset, is very similar to the original LeNet one [32]. The first convolutionallayer has 32 filters and kernel size equal to 3, while the second one has 16 filters, with thesame kernel size. Both max pooling layers have a kernel size of 2. The fully connecteddense layer has 128 neurons, with a dropout probability of 0.5. ReLU nonlinearities [38] areadopted after every batch normalization [22] operation, and also as activation functions forthe classifier neurons in the fully connected dense layer.

Our version of the fully-convolutional network is based on the All-Conv Net proposedby Springenberg et al. [52], with the addition of batch normalization units after eachconvolutional layer, and is used as our baseline model for the CIFAR-100 dataset. Asummary of both networks architectures can be observed in Figure 6.1.

For what concerns the training details of the baseline models, we adopted the cross entropyloss for both models, minimized through Adam optimizer, with the standard hyperparameters

Page 66: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

46 Experiments

3x3 ­ conv ­ 32 ­ 1 ­ 1

32 ­ batch norm

2x2 ­ maxpool ­ 2 ­ 0

3x3 ­ conv ­ 16 ­ 1 ­ 1

16 ­ batch norm

2x2 ­ maxpool ­ 2 ­ 0

128 ­ fully connected

0.5 ­ drop out

10 ­ fully connected

10 ­ softmax

32x32 RGB Image

32x32x32

16x16x32

16x16x16

8x8x16

128x1

10x1

(a) Simple Conv Net

0.2 ­ drop out

3x3 ­ conv ­ 96 ­ 1­ 1

96 ­ batch norm

3x3 ­ conv ­ 96 ­ 1 ­ 1

96 ­ batch norm

3x3 ­ conv ­ 96 ­ 2 ­ 1

96 ­ batch norm

0.5 ­ drop out

3x3 ­ conv ­ 192 ­ 1 ­ 1

192 ­ batch norm

3x3 ­ conv ­ 192 ­ 1 ­ 1

192 ­ batch norm

3x3 ­ conv ­ 192 ­ 2 ­ 1

192 ­ batch norm

0.5 ­ drop out

3x3 ­ conv ­ 192 ­ 1 ­ 0

192 ­ batch norm

1x1 ­ conv ­ 192 ­ 1 ­ 0

192 ­ batch norm

1x1 ­ conv ­ 100 ­ 1 ­ 0

192 ­ batch norm

6x6 global avg. pool

100 ­ softmax

32x32 RGB Image

32x32x96

16x16x96

16x16x192

8x8x192

6x6x192

6x6x100

1x1x100

(b) All-Conv Net

Figure 6.1 Network architectures for our baseline models. Convolutional layers are param-eterized by kxk-conv-d-s-p, where kxk is the spatial extent of the filter, d is the number ofoutput filters in a layer, s represents the filter stride and p indicates the zero-padding added toboth sides of the input. Max-pooling layers are parameterized as kxk-maxpool-s-p, where sis the spatial stride and p indicates the implicit zero padding to be added on both sides. Batchnormalization layers are parameterized by d-bn, where d is the number of features in the layer.Dropout layers are parameterized by pr-dp, where pr is the dropout probability value. Fullyconnected linear layers are parameterized by d-fc, where d represents the dimensionalityof the output space. (a) Simple convolutional network for CIFAR-10. (b) All-Conv Net forCIFAR-100.

listed in Adam’s paper [26]. Model fitting was done in an validation-based early stoppingsetting, adopting a patience hyperparameter of 15 epochs.

Each input image was also preprocessed according to two separate techniques, dependingon the type of network, as this proved to be the best way to demonstrate the effectiveness ofour methodology:

Page 67: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.2 Results and Discussion 47

• Simple Conv Net (CIFAR-10): scaling of each pixel value within the [0-1] range,dividing each pixel by 255;

• All-Conv Net (CIFAR-100): mean subtraction first, which consists in subtracting themean across every individual feature in the data, followed by a normalization step,performed dividing each dimension by its standard deviation, once it has been zero-centered.

6.1.4 Clustering

The last experimental setting is related to the clustering method, mentioned in Section 4.4.We decided to experiment different approaches, varying on which features to use to clustereach image and on which distance metric to adopt to compare the data points. For whatconcerns the features, we decided to cluster either directly on the pixels, where each featureis one of the 32x32 pixels of the image, or on the one-dimensional collapsed version of thebaselines’ feature maps activations, at the output of the low-level convolutional layers. Withrespect to the distance metric, only Euclidean distance [1] was considered when featureswhere represented by image pixels, whereas also Hamming distance [57] was tested forthe convolutional feature maps case. In case Hamming distance was used, feature mapsactivations were first converted into binary vectors, setting each element to 1, whenever thecorresponding value was greater than 0, whereas set to 0 whenever this condition was notmet.

6.2 Results and Discussion

This section, as mentioned at the beginning of the chapter, is one of the core sections of ourentire work, since it illustrates, through the use of heatmaps and plots, what is the actualimpact that is made by our methodology on the baseline model, trained on undistortedimages. The section starts off with a demonstration of how noise acts on the convolutionalfilters of the baseline model, when image distortion is present on the input data. Next,we demonstrate the potential of the swapping methodology, which we introduced to thereader to show how our intuition behind the selective fine-tuning approach was conceived.We then proceed exhibiting the effects of our selective filters fine-tuning approach, first interms of classification performance improvement, then by means of increased robustness ofconvolutional filters to noise. Finally, we present how our non associative approach is stilla valid method to obtain the ranking of the filters most susceptible to image distortion, forsettings in which obtaining the (clean, noisy) pair of the same image is not feasible, proving

Page 68: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

48 Experiments

that our approach is able to find a reasonable sample of the filters that were found throughthe more accurate associative approach.

6.2.1 Noise Impact on Baseline Models

Before focusing our attention to the effectiveness of the proposed transfer learning technique,it is important to demonstrate whether, for a network trained on undistorted images, onlysome of the convolutional filters in the network are susceptible to noise or blur in the inputimage. As we can see in Figure 6.2, it is clear how certain convolutional filters are farmore susceptible to input distortions than others. Considering for example only the firstconvolutional layer, in our baseline model trained on CIFAR-10 data, and applying ourassociative method described in Section 4.2.1, we can see how some activations always tendto be ranked higher by our voting technique. This demonstrates that their correspondingconvolutional filters are far more sensitive to input distortion than others. Restoring theactivations of only the filters that are more susceptible to input distortions can reduce thetime and computational resources involved in enhancing DNN robustness to such distortions.

Interestingly enough, this set of convolutional filters vulnerable to input distortion, seemsto be independent from the type of distortion. By looking at the bar plots in Figure 6.2, we cansee how the filters that were most affected by noise in the AWGN case, tend to be the samefilters in the blurring case, with the exception of a few elements. This is an interesting result:since the set of filters to fine-tune is shared among the two types of distortion, by simplyfine-tuning the baseline models with distorted images from one of the two, the network couldend up being robust to the other type of distortion too. This investigation could be subject offuture work.

6.2.2 Activation Maps Swapping

Following the results of the previous sub-section, which proved that some filters are moresusceptible to image distortions than others, we tried to correct the outputs of the mostsusceptible convolutional filters through the swapping of their feature maps activations withthe ones they produced when given in input the clean version of the same image. Thisapproach, introduced in Section 4.3.1, served as a proof of concept for the fact that correctingthe image representation towards the one produced by the clean samples improves the model’sclassification performance.

In order to assess such improvement, we took the first three layers in the All-Conv modelwhich, as described in 6.1b, are composed of 96 3x3 convolutional filters. For each layer,we computed the filter rankings and then we proceeded to swap the feature maps in all 3

Page 69: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.2 Results and Discussion 49

0 5 10 15 20 25 30Feature map index

0

50000

100000

150000

200000

250000

Bord

a co

unt

Convolutional layer 1 - AWGN with = 15 (CIFAR-10)

0 20 40 60 80 100Feature map index

0

20000

40000

60000

80000

100000

Bord

a co

unt

Convolutional layer 1 - AWGN with = 15 (CIFAR-100)

0 5 10 15 20 25 30Feature map index

0

50000

100000

150000

200000

250000

Bord

a co

unt

Convolutional layer 1 - Blur with = 1.25 (CIFAR-10)

0 20 40 60 80 100Feature map index

0

20000

40000

60000

80000

100000

Bord

a co

unt

Convolutional layer 1 - Blur with = 1.25 (CIFAR-100)

Figure 6.2 Distortion susceptibility of convolutional filters in the first convolutional layerof both baseline models, when tested on training images respectively from CIFAR-10 andCIFAR-100. Even though the Borda counts are slightly different between the two types ofdistortions, it is clear how the ranking of the most susceptible filters tends to be independentof the type of distortion applied. In fact, considering only the 25% of the most sensitivefeature maps of each depicted convolutional layer, 6 out of 8 times there is a match in theselected filters, for the CIFAR-10 case, while a 19 out of 24 ratio for the CIFAR-100 one.

layers, one by one, following the order imposed by the ranking. In Figure 6.3 we show theresults of the proposed technique when the model is tested on images distorted with AWGNand Gaussian Blur: In both cases the classification performance reaches the classificationperformance of the clean pictures after swapping around a quarter of the filters, steeplyincreasing towards the limit imposed by the accuracy on the clean samples, especially withthe first swapped feature maps.

Another interesting fact, which we also introduced in Section 4.3.1, is that when themajority of the activation maps have been swapped, it may happen that the classificationperformance exceeds the one obtained on the clean samples. This may be due to someregularization introduced by the distortion in the input which, if present in small amounts,improves the model generalization capabilities.

6.2.3 Filters Fine-Tuning on Target Dataset

In this section, we evaluate the proposed approach of fine-tuning baseline models withdistortion affected inputs, from the two datasets and architectures mentioned in Section6.1. For the reproducibility of the results of the experiments, it is important to note that theclassification performance is evaluated independently for each type of distortion. Furthermore,

Page 70: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

50 Experiments

Figure 6.3 The two graphs represent the model accuracy when tested on distorted data(Additive Gaussian White Noise on top, Gaussian blur at the bottom) with an increasingnumber of feature maps swapped with the one coming from the same undistorted image. Thegreen line represents the model performance on clean images. As expected, when all 96feature maps are swapped, the two lines meet.

for the baseline trained on CIFAR-10, all shown results are obtained with the fully connectedlayer unchanged, meaning no fine-tuning was performed on the classifier’s weights. We didso, because we wanted to evaluate the impact of our technique on the representation learningcapabilities of the network, rather than just measuring the improvement in terms of accuracy.Nonetheless, the reader should be aware that, when performing transfer learning, it may also

Page 71: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.2 Results and Discussion 51

be necessary to retrain a new classifier from scratch, due to differences between source andtarget tasks.

Improving Classification Accuracy of the Baseline Model

One of the major objectives of our proposed transfer learning technique is to improve theclassification performance of the baseline model when tested on noisy samples. Figure 6.4summarizes the results of the conducted evaluation of such technique, when only 25% ofthe convolutional filters of each of the corresponding convolutional layers are fine-tuned.For the CIFAR-10 baseline, convolutional filters are fine-tuned in both convolutional layerssimultaneously, while in the All-Conv Net, trained on CIFAR-100, only early convolutionallayers are corrected. Precisely, only the first three convolutional layers. This because, asconfirmed by [3], the best performance is achieved correcting filters in early convolutionallayers of the network. In fact, as we go deeper in the network, accuracy diminishes forcorrecting a fixed percentage of convolutional filters, which indicates that, as we go deeper inthe network, all the convolutional filters become more or less equally susceptible to distortionin the input data.

The results in Figure 6.4, which plot classification accuracy as a function of the numberof noisy training points used to fine-tune the baseline models, clearly demonstrate how, forsmall training set sizes, fine-tuning only the most affected convolutional filters yields a betterclassification performance than fine-tuning all the filters in the selected layers, or fine-tuningonly the filters that are least susceptible to input distortion, confirming our hypotheses.Instead, for larger dataset sizes, fine-tuning the entire set of filters in the considered layersproves to be more convenient, as expected.

These conclusions are supported by the graphs that consider the "central" level of distor-tion (AWGN with σ = 15 and Blur with σ = 1.25), represented in Figure 6.4: independentlyof the type of noise and network architecture, a moderate level of distortion - as the onesconsidered in such configurations - is efficiently handled by our technique. In fact, forlimited amount of training noisy samples used to fine-tune the baseline networks, the "most"configuration is able to outperform the other two, considerably reducing the computationalrequirements to fine-tune all the convolutional filters in the layer. Instead, when a larger setof noisy samples is available, fine-tuning all the convolutional layers proves to be the mostefficient solution.

For what concerns the other two levels of distortions, listed in Section 6.1.2, otherconsiderations need to be made: at the bottom of the figure are illustrated all the scenariosthat involve the smallest amount of distortion that we considered. In these settings, it isevident how our proposed technique is meaningless, if not even detrimental. For such limited

Page 72: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

52 Experiments

Figure 6.4 Fine-tuning effects on the classification performance of noisy inputs, as a functionof the number of training distorted images used to perform such fine-tuning. For each plot,three different configurations were considered: (1) most: fine-tuning is performed only onthe 25% of the layer convolutional filters most susceptible to image distortion; (2) least:fine-tuning is performed only on the 25% of the layer convolutional filters least susceptibleto image distortion; (3) all: fine-tuning is performed on all convolutional filters of the layer,independently from their susceptibility to image distortion. For the plots in the first twocolumns, fine-tuning was done on the baseline model trained on CIFAR-10 undistortedimages, correcting the convolutional filters from both convolutional layers of the network.For the plots in the right-most two columns, fine-tuning was performed on the baseline modeltrained on CIFAR-100 pristine samples, correcting the convolutional filters only from thefirst three convolutional layers of the network. In the last row we present the fine-tuningperformed with the least amount of distortion, which barely produces any effect.

levels of distortion, fine-tuning the baseline models with distorted samples is not necessary,because the dataset shift problem, first mentioned in Chapter 1, is essentially nonexistent,as the the difference between source and target images is fundamentally absent. Lookingat the specified plots, we see that the improvement is in the order of an ideal 2% margin,which would not justify the need for such transfer learning process. On the other hand,

Page 73: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.2 Results and Discussion 53

scenarios that involve a substantial level of distortion (AWGN with σ = 25 and Blur withσ = 2.25), present totally different issues. When the input samples are seriously corrupted,two outcomes occur:

1. The "all" configuration consistently outperforms the other two configurations (e.g.CIFAR-100 with blur distortion);

2. The "all" configuration needs fewer noisy training samples to surpass the "most"configuration, with respect to the "central" level of image distortion that we previouslydiscussed (e.g. CIFAR-100 with AWGN distortion).

What these configurations results clearly imply is that the higher is the level of distortion,the greater is the number of parameters that need to be corrected to account for suchconspicuous corruption. Limiting the number of filters to retrain, would then be too high ofa constraint for the baseline model to be able to accordingly adapt to classify so distortedsamples. However, it is important to point out that the downsides of the proposed approachon this high level values of distortion is relevant because of the input data resolution. BeingCIFAR datasets images so little, the effects of the corruption on the images is substantial. Onhigher resolution samples (e.g. ImageNet), severely larger corruption intensities would beneeded to incur in the drawbacks that we have just presented.

To further improve the classification performance of the baseline model, it is importantto note that, in order to maximize the effectiveness of the proposed technique, fine-tuning asubset of the most affected filters from multiple layers proved to be more successful thanfine-tuning only a single layer of the network at a time. This result proves to be intuitive, as alarger amount of parameters is fine-tuned, enabling a potentially greater level of correctioncapabilities on the network. Nonetheless, the number of parameters is still limited, withrespect to fine-tuning all the convolutional filters of the considered layers, causing ourproposed technique to outperform the usual fine-tuning of the entire layer parameters, inthe aforementioned circumstances, while also limiting the computational requirements toperform the fine-tuning.

In conclusion, it is also relevant to point out that, as mentioned at the beginning of thesection, we decided to fine-tune only one fourth of the convolutional filters present in eachconvolutional layer. This decision was not based on a thorough assessment of how manyfilters to fine-tune per layer and we leave this task as a subject for future work. A startingpoint could be to fine-tune only the set of convolutional filters whose EMD metric is above acertain threshold. By doing so, we would make sure to fine-tune only the filters who accountfor the greatest degradation in performance, and not include also those who happen to bein the considered 25% but do not vary as much as the other ones at the top of the ranking.

Page 74: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

54 Experiments

Nonetheless, it is clear how our convolutional filters ranking methodology, detailed in Section4.2, proves to always be beneficial. In fact, fine-tuning the top 25% of the convolutionalfilters most susceptible to input data distortion - according to our ranking - is consistentlybetter than retraining the bottom 25%. Evidence on this remark can be observed in everyconfiguration depicted in Figure 6.4.

Reducing the Covariate Shift Effects

As the last intrinsic result of our transfer learning technique, we recall that one of thefundamental objectives of our method was to overcome the problem of covariate shift, firstmentioned in Section 1.2. This phenomenon was said to be responsible for the degradationin performance of our baseline models when tested on distorted images. The reliableperformance of the fine-tuned models to several noise intensities suggests that the fine-tunednetworks learned to be invariant to such noises.

Inspired by [56], we replicated their empirical evaluation methodology to assess suchfeature invariance to input distortion. In detail, we look at the similarity in activations ofdifferent layers of both the baseline and the fine-tuned All-Conv Net model, when sharp andnoisy versions of the same image are given as input to the network. Specifically, we considerthe feature maps at the output of the first three convolutional layers in the All-Conv networks,being them the only convolutional layers that were subject to our fine-tuning approach. Weconvert the feature vector at every location into a binary string representing whether eachfeature channel had a positive or zero response.

In Figure 6.5, we visualize Hamming distances between corresponding binary stringsproduced from a sharp and AWGN distorted versions of the same example image, for thebaseline All-Conv model, and the models fine-tuned on images affected by AWGN (σ = 15),both when the fine-tuning was done on all convolutional filters per layer, or only the mostaffected ones. What we can assess from the heatmaps in the figure is that the baseline modelproduces different activations on the sharp and noisy inputs, at all layers. In contrast, thefine-tuned models are able to achieve a reasonable amount of noise invariance, with lowdistances between sharp and distorted activations in all three fine-tuned layers, with themodel fine-tuned only on the most affected convolutional filters being slightly more invariantto image distortion than the one where all the filters were fine-tuned. This result is in linewith our hypotheses and positively exhibits the reasoning behind our assumptions.

Another relevant way of assessing if the model, fine-tuned through our selective filter-level fine-tuning approach, has actually achieved the so called feature invariance is to takea look at the two-dimensional t-SNE embedding of the features of the fine-tuned networkmodel, precisely as we did in Section 4.1, where the embedded features were taken from the

Page 75: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.2 Results and Discussion 55

All­

Con

v N

etba

selin

e

conv1 ­ (114.417) conv2 ­ (172.021) conv3 ­ (68.01)

Sharp Gaussian ( = 15)

All­

Con

v N

etft_

all (

1­2­

3)

conv1 ­ (106.865) conv2 ­ (147.094) conv3 ­ (39.51)

All­

Con

v N

etft_

mos

t (1­

2­3)

conv1 ­ (97.615) conv2 ­ (132.865) conv3 ­ (37.698)

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Figure 6.5 Disparity between corresponding layer activations on sharp and noisy versionsof an example image. Each heat-map represents the Hamming distance between binarizedfeature vectors (i.e., if each channel is positive or zero) at corresponding locations in thesharp and Gaussian distorted inputs. We visualize these distance maps for the first threeconvolutional layers in the All-Conv Net architecture, comparing three different models: Top:Baseline, trained on undistorted images; Middle: Network where all the filters in the firstthree convolutional layers where fine-tuned with 5000 Gaussian distorted training imagesfrom CIFAR-100; Bottom: Network where only the top 25% of the convolutional filters mostsusceptible to Gaussian distortion in the first three convolutional layers where fine-tuned with5000 Gaussian distorted training images from CIFAR-100. The numbers between the roundbrackets indicate the element-wise sum of each element in the corresponding heat-map: Thehigher the number, the larger is the disparity between the matching layer activations on thesharp and noisy version of the example image. We see that model where only the filtersmost susceptible to Gaussian distortion where fine-tuned produces feature activations thatare relatively invariant to the presence of Gaussian noise in the input image.

baseline model, trained on pristine images from CIFAR-100. In this case, we take a lookat the 3600-dimensional features coming from the last convolutional layer of our All-ConvNet model, respectively fine-tuned with 5000 images distorted through Gaussian blur withσ = 1.25 (Figure 6.6 - Top row) or fine-tuned with 5000 images distorted through AWGNwith σ = 15 (Figure 6.6 - Bottom row). What we demonstrate in Figure 6.6, unlike theclustering of classes pictured in Figure 4.2, is that the features - coming from the model wereonly the filters most susceptible to input data distortion were fine-tuned - are discriminativeenough to generate a concise clustering of images from the same class and also provide agood separation between clusters of other classes, even as distortion severity increases. Infact, even though the distortion severity depicted also includes values of standard deviationthat were not considered in the images used for fine-tuning the model, the latent featureslearned by the models are still able to properly handle this difference in terms of distortion

Page 76: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

56 Experiments

Original Blur = 0.25 Blur = 1.25 Blur = 2.25

Original AWGN = 5 AWGN = 15 AWGN = 25

Figure 6.6 Two-dimensional t-SNE embedding of the last layer features of the fine-tunedAll-Conv net, visualized for original (undistorted), blurred and noise affected images of10 classes from the CIFAR-100 test set, with each color representing a separate class anddistortion severity increasing from left to right. Each point in the embedding represents animage in the 10 class subset, with 100 images per class. Top row: Embedding for Gaussianblur affected images. Bottom row: Embedding for AWGN affected images.

intensity. This result is very promising and it implicates that a model fine-tuned through ourselective filter-level fine-tuning, could actually achieve a reasonably good level of featureinvariance when tested on images distorted with different distortion intensities from the oneson which it was fine-tuned upon.

6.2.4 Non Associative Ranking

To complete our discussion about the evaluation of the proposed methodology, we evaluatethe non associative technique proposed in Section 4.4. As previously stated, this methodis meant to provide a solution to the case in which the (clean, noisy) image pair, for eachtraining image, is not available. We presented a way to overcome this problem, relying onthe so called exemplars.

Table 6.1 shows, for the first three convolutional layers in the All-Conv Net baselinemodel, trained on CIFAR-100, how many filters identified by the different configurations ofour non associative approach actually match the set of filters identified by the associativetechnique.

As we can see from the table, the numbers of matching filters is definitely promising,with the configuration that uses image pixels being the one that - overall - was able to findthe ranking closest to the associative one. It is important to note that, even though such listsare produced as rankings in the first place, the actual ordering is not relevant, because each of

Page 77: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

6.2 Results and Discussion 57

Layer 1 Layer 2 Layer 3Distance Pixels FTM Pixels FTM Pixels FTMEuclidean 18/24 14/24 13/24 13/24 19/24 19/24Hamming - 13/24 - 11/24 - 17/24

Table 6.1 Comparison of the number of matching convolutional filters, per layer, betweenthe associative ranking and several configurations of the non associative ones. All thenon associative rankings, and the associative one used as "ground truth", are based on theAll-Conv Net baseline model, trained on undistorted images from CIFAR-100. All threeconvolutional layers have 96 filters each, so the top 25% of each layer only considers 24filters. The noisy images that are used to perform the comparison between clean and noisyactivations were perturbed with AWGN with distortion severity σ = 15 (comparable resultsare obtained when blurring distortion is in place). Pixels indicates image pixels were used asfeatures for the clustering method, whereas FTM when the collapsed version of the baselinefeature maps activations - at the output of corresponding convolutional layer - were adopted.

them simply represents a set of convolutional filters that need to be fine-tuned. Because of this,the fact that a given index value comes later in the ranking from the non associative approachthan the associative one, it is not going to have an impact on the fine-tuning performance.Therefore, the highest the number of matching indices, the better will be the fine-tuning. Thisresult guarantees that fine-tuning the convolutional filters indicated by this technique willactually achieve approximately the same performance of our associative approach, assuringthe applicability of our methodology to every empirical setting, associative or not.

Page 78: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Chapter 7

Conclusion

In this work, we assessed that Convolutional Neural Networks trained on pristine imagesperform poorly when employed in settings where image distortions are present. We testeddifferent network architectures against two types of common image distortion, Gaussian Blurand Additive White Gaussian Noise, applied with different intensities. We observed thatsome of the filters in each convolutional layer of the Convolutional Neural Networks are farmore susceptible to input distortions than others, and proved that this set of filters is nearlyindependent of the distortion applied to the input data.

Following these findings, we proposed a novel technique to measure the susceptibility ofconvolutional filters to input data distortion and used this procedure to identify the filters thatcontribute the most to the drop in classification performance that occurs when a ConvolutionalNeural Network, trained on pristine images, is tested on distorted ones. We demonstratedhow our assessment technique is also applicable to situations in which the correspondencebetween clean and noisy versions of the same image is not available, providing a solution toscenarios that related works in the literature would fail to assist.

We designed a new way to perform transfer learning, moving on from the usual fine-tuning of the entire batch of convolutional filters in a layer of the network, improving therobustness of the network against image distortions. Fine-tuning only the most distortion-susceptible convolutional filters of the model while leaving the rest of the pre-trained filtersin the network unchanged proved to be a good solution both in terms of classificationperformance and amount of parameters to train. Deep Neural Networks trained with thisapproach outperform the ones fine-tuned with the traditional layer-level technique whenlabeled data in the noisy domain is limited, and obtain a comparable performance whentraining data in this setting is vastly available, while still maintaining a lower number oftrainable parameters. This fact makes fine-tuning of large networks a feasible task, with thenumber of trainable parameters being lower than competing solutions, such as traditional

Page 79: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

59

fine-tuning or ad hoc correction units that are added at the output of the most distortionsusceptible filters in each convolutional layer. Moreover, we proved that this fine-tuningtechnique accomplishes the task of learning features that are invariant to distorted input data.

These findings can be employed in many different real-world applications, specifically inthe following situations:

• When there is a small number of labeled samples in the Target Domain (distortedimages) this technique proved to be very effective, performing better than traditionalfine-tuning;

• When there are enough samples in the Target Domain, but the computational re-quirements to fine-tune the entire Convolutional Neural Network are too high, thisapproach proved to be able to reach comparable performance, but with fewer trainableparameters, making the training procedure feasible with fewer resources.

Some additional benefits of this approach are the following: It is blind to the kind andintensity of input distortion, as long as some distorted samples are provided; it does not need(pristine, distorted) image pairs to assess the distortion impact on the convolutional filters.

There are many directions for future developments:

• We always only fine-tuned the top 25% of the most affected filters, but this may notalways be the most effective or efficient choice. Depending on the situation it may bea better choice to train more or fewer filters. Finding a good way to decide how manyfilters to fine-tune for each layer can yield better performance and more computationalsavings.

• An interesting finding, which is probably worth investigating, is that the set of vulnera-ble filters in a layer seems to be rather independent of the distortion type and intensity.Exploring this theory could shed more light about the impact that different distortiontypes have on Convolutional Neural Networks.

• One last dimension we did not explore but could be worth investigating is the fine-tuning procedure. During our experiments, we only tested two alternatives: applyingour approach to a single layer, or to all layers at once. What could be tested instead, isa cascading procedure in which for each layer, starting from the input one, we computethe set of most vulnerable filters and we fine-tune them before proceeding with thefollowing ones. This way the distortion impact assessment always shows the effects ofdistortion on that specific layer, since the previous ones have already been fine-tuned.

Page 80: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

60 Conclusion

While presenting a novel approach, this thesis leaves many open questions that could befurther investigated by future works. Applying transfer learning to pre-trained Deep NeuralNetworks proved itself a very good way to bring the power of deep learning models closerto users with low computational resources or data available. Improving these techniquescan have an important impact on real-world applications, generating new use cases andopportunities for the users.

Page 81: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Bibliography

[1] Anton, H. (1993). Elementary Linear Algebra. Wiley & Sons.

[2] Bauckhage, C. (2015). Numpy / scipy recipes for data science: k-medoids clustering.

[3] Borkar, T. S. and Karam, L. J. (2017). Deepcorrect: Correcting DNN models againstimage distortions. CoRR, abs/1705.02406.

[4] Chollet, F. et al. (2015). Keras. https://keras.io.

[5] Cross Validated User (antike) (2018). What is the advantages of wasser-stein metric compared to kullback-leibler divergence? Cross Validated.URL:https://stats.stackexchange.com/q/351153 (version: 2018-06-13).

[6] Dantzig, G. B. (1951). Application of the simplex method to a transportation problem. InIn Activity Analysis of Production and Allocation, pages 359–373. John Wiley and Sons.

[7] Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. (2009). Imagenet: Alarge-scale hierarchical image database. In 2009 IEEE Conference on Computer Visionand Pattern Recognition, pages 248–255.

[8] Dodge, S. F. and Karam, L. J. (2016). Understanding how image quality affects deepneural networks. CoRR, abs/1604.04004.

[9] Dodge, S. F. and Karam, L. J. (2018). Quality robust mixtures of deep neural networks.IEEE Transactions on Image Processing, 27(11):5553–5562.

[10] Emerson, P. (2016). From Majority Rule to Inclusive Politics. Springer InternationalPublishing.

[11] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010).The pascal visual object classes (voc) challenge. International Journal of Computer Vision,88(2):303–338.

[12] Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points.Science, 315(5814):972–976.

[13] Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. (2013). Rich feature hierarchiesfor accurate object detection and semantic segmentation. CoRR, abs/1311.2524.

[14] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.http://www.deeplearningbook.org.

Page 82: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

62 Bibliography

[15] Google. Deep Learning VM - Google Cloud Platform. https://console.cloud.google.com/marketplace/details/click-to-deploy-images/deeplearning?_ga=2.269005809.-1411201463.1547487403. [Online; accessed 25-March-2019].

[16] Google. Google Colab. https://colab.research.google.com. [Online; accessed 25-March-2019].

[17] Google. Google Colab: Frequently Asked Questions. https://research.google.com/colaboratory/faq.html. [Online; accessed 25-March-2019].

[18] Han, J., Kamber, M., and Tung, A. (2001). Spatial clustering methods in data mining: asurvey. Data Mining and Knowledge Discovery - DATAMINE.

[19] He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for imagerecognition. CoRR, abs/1512.03385.

[20] Huang, G., Liu, Z., and Weinberger, K. Q. (2016). Densely connected convolutionalnetworks. CoRR, abs/1608.06993.

[21] Huval, B., Wang, T., Tandon, S., Kiske, J., Song, W., Pazhayampallil, J., Andriluka, M.,Rajpurkar, P., Migimatsu, T., Cheng-Yue, R., Mujica, F., Coates, A., and Ng, A. Y. (2015).An empirical evaluation of deep learning on highway driving. CoRR, abs/1504.01716.

[22] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. CoRR, abs/1502.03167.

[23] Jones, E., Oliphant, T., Peterson, P., et al. (2001–). SciPy: Open source scientific toolsfor Python. [Online; accessed 25-March-2019].

[24] Karpathy, A. (2018). Stanford University CS231n: Convolutional Neural Networksfor Visual Recognition. http://cs231n.github.io/transfer-learning. [Online; accessed31-January-2019].

[25] Kaufman, L. and Rousseeuw, P. (1987). Clustering by Means of Medoids. Delft Uni-versity of Technology : reports of the Faculty of Technical Mathematics and Informatics.Faculty of Mathematics and Informatics.

[26] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR,abs/1412.6980.

[27] Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B. E., Bussonnier, M., Frederic,J., Kelley, K., Hamrick, J. B., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S.,Willing, C., and et al. (2016). Jupyter notebooks - a publishing format for reproduciblecomputational workflows. In ELPUB.

[28] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tinyimages. Master’s thesis, Department of Computer Science, University of Toronto.

[29] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification withdeep convolutional neural networks. In Proceedings of the 25th International Conferenceon Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA.Curran Associates Inc.

Page 83: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Bibliography 63

[30] Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math.Statist., 22(1):79–86.

[31] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., andJackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. NeuralComput., 1(4):541–551.

[32] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learningapplied to document recognition. In Proceedings of the IEEE, pages 2278–2324.

[33] Lenz, I., Lee, H., and Saxena, A. (2013). Deep learning for detecting robotic grasps.CoRR, abs/1301.3592.

[34] Lippman, D. (2017). Math in Society. CreateSpace Independent Publishing Platform.

[35] Maaten, L. and Hinton, G. (2008). Visualizing high-dimensional data using t-sne.Journal of Machine Learning Research, 9:2579–2605.

[36] Mitchell, T. M. (1997). Machine learning, International Edition. McGraw-Hill Seriesin Computer Science. McGraw-Hill.

[37] Moreno-Torres, J. G., Raeder, T., Alaiz-RodríGuez, R., Chawla, N. V., and Herrera, F.(2012). A unifying view on dataset shift in classification. Pattern Recogn., 45(1):521–530.

[38] Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltz-mann machines. In Proceedings of the 27th International Conference on InternationalConference on Machine Learning, ICML’10, pages 807–814, USA. Omnipress.

[39] Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploringgeneralization in deep learning. CoRR, abs/1706.08947.

[40] Oliphant, T. (2006–). NumPy: A guide to NumPy. USA: Trelgol Publishing. [Online;accessed 25-March-2019].

[41] Olivas, E. S., Guerrero, J. D. M., Sober, M. M., Benedito, J. R. M., and Lopez, A.J. S. (2009). Handbook Of Research On Machine Learning Applications and Trends:Algorithms, Methods and Techniques - 2 Volumes. Information Science Reference -Imprint of: IGI Publishing, Hershey, PA.

[42] Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions onKnowledge and Data Engineering, 22(10):1345–1359.

[43] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison,A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.

[44] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning inPython. Journal of Machine Learning Research, 12:2825–2830.

[45] Penington, J. and J.F. Dow, R. (1991). Creating artificial neural networks that generalize.Neural Networks, 4:67–79.

Page 84: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

64 Bibliography

[46] PyTorch (2019). Learning pytorch with examples - pytorch tutorials. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html. [Online; accessed 25-March-2019].

[47] Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster R-CNN: towards real-timeobject detection with region proposal networks. CoRR, abs/1506.01497.

[48] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storageand organization in the brain. Psychological Review, pages 65–386.

[49] Rubner, Y., Tomasi, C., and Guibas, L. J. (1998). A metric for distributions withapplications to image databases. In Proceedings of the Sixth International Conference onComputer Vision, ICCV ’98, pages 59–, Washington, DC, USA. IEEE Computer Society.

[50] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F. (2014). Imagenet large scalevisual recognition challenge. CoRR, abs/1409.0575.

[51] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.

[52] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. (2014). Strivingfor simplicity: The all convolutional net. CoRR, abs/1412.6806.

[53] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.,15(1):1929–1958.

[54] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning withneural networks. CoRR, abs/1409.3215.

[55] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D.,Vanhoucke, V., and Rabinovich, A. (2014). Going deeper with convolutions. CoRR,abs/1409.4842.

[56] Vasiljevic, I., Chakrabarti, A., and Shakhnarovich, G. (2016). Examining the impact ofblur on recognition by convolutional networks. CoRR, abs/1611.05760.

[57] Warren, H. (2012). Hacker’s Delight. Pearson Education.

[58] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are featuresin deep neural networks? CoRR, abs/1411.1792.

[59] Zeiler, M. D. and Fergus, R. (2013). Visualizing and understanding convolutionalnetworks. CoRR, abs/1311.2901.

[60] Zhou, Y., Song, S., and Cheung, N. (2017). On classification of distorted images withdeep convolutional neural networks. CoRR, abs/1701.01924.

Page 85: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

Appendix A

Code Listings

This appendix is meant to provide the code listings for all the major components of ourmethodology. References to these code listings can be found in Chapter 5.

A.1 Baseline Models

This section includes the definition of the two Python modules that implement our baselinemodels, which were trained on undistorted images.

A.1.1 Simple Convolutional Net

1 import torch.nn as nn2

3 class SimpleConvNet(nn.Module):4 def __init__(self):5 super(SimpleConvNet, self).__init__()6

7 self.conv1 = nn.Conv2d(in_channels=3, out_channels=32,kernel_size=3, stride=1, padding=1)→

8 self.conv1_bn = nn.BatchNorm2d(num_features=32) # equals thenumber of the previous output channels→

9 self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)10

11 self.conv2 = nn.Conv2d(in_channels=32, out_channels=16,kernel_size=3, stride=1, padding=1)→

Page 86: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

66 Code Listings

12 self.conv2_bn = nn.BatchNorm2d(num_features=16) # equals thenumber of the previous output channels→

13 self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)14

15 self.fc1 = nn.Linear(16 * window_size * window_size, 128)16 self.fc1_dp = nn.Dropout2d(p=0.5)17 self.fc2 = nn.Linear(128, 10)18

19 def forward(self, x):20

21 x = self.pool1(F.relu(self.conv1_bn(self.conv1(x))))22 x = self.pool2(F.relu(self.conv2_bn(self.conv2(x))))23

24 x = x.view(-1, 16 * window_size * window_size)25 x = F.relu(self.fc1(x))26 x = self.fc1_dp(x)27 x = self.fc2(x)28

29 return x

Listing 1 Simple Convolutional Net

A.1.2 All Convolutional Net

1 import torch.nn as nn2

3 class AllConvNet(nn.Module):4 def __init__(self):5 super(AllConvNet, self).__init__()6

7 self.img_dp = nn.Dropout2d(p=0.2)8

9 # one padding layers10 self.conv1 = nn.Conv2d(in_channels=3, out_channels=96,

kernel_size=3, stride=1, padding=1)→

11 self.conv1_bn = nn.BatchNorm2d(num_features=96)

Page 87: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.1 Baseline Models 67

12 self.conv2 = nn.Conv2d(in_channels=96, out_channels=96,kernel_size=3, stride=1, padding=1)→

13 self.conv2_bn = nn.BatchNorm2d(num_features=96)14 self.conv3 = nn.Conv2d(in_channels=96, out_channels=96,

kernel_size=3, stride=2, padding=1)→

15 self.conv3_bn = nn.BatchNorm2d(num_features=96)16 # dropout is applied after each layer replacing the max pooling

operation→

17 self.mp1_dp = nn.Dropout2d(p=0.5)18 self.conv4 = nn.Conv2d(in_channels=96, out_channels=192,

kernel_size=3, stride=1, padding=1)→

19 self.conv4_bn = nn.BatchNorm2d(num_features=192)20 self.conv5 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=3, stride=1, padding=1)→

21 self.conv5_bn = nn.BatchNorm2d(num_features=192)22 self.conv6 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=3, stride=2, padding=1)→

23 self.conv6_bn = nn.BatchNorm2d(num_features=192)24 # dropout is applied after each layer replacing the max pooling

operation→

25 self.mp2_dp = nn.Dropout2d(p=0.5)26

27 # zero padding layers28 self.conv7 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=3, stride=1, padding=0)→

29 self.conv7_bn = nn.BatchNorm2d(num_features=192)30 self.conv8 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=1, stride=1, padding=0)→

31 self.conv8_bn = nn.BatchNorm2d(num_features=192)32 self.conv9 = nn.Conv2d(in_channels=192, out_channels=100,

kernel_size=1, stride=1, padding=0)→

33 self.conv9_bn = nn.BatchNorm2d(num_features=100)34

35 # pooling layer -> same is done averaging over the flattenedfeature map→

36 # self.avg_pool = nn.AvgPool2d(kernel_size=6)

Page 88: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

68 Code Listings

37

38 def forward(self, x):39

40 # dropout on image input41 x = self.img_dp(x)42

43 # relu after batch_norm44 x = F.relu(self.conv1_bn(self.conv1(x)))45 x = F.relu(self.conv2_bn(self.conv2(x)))46 x = F.relu(self.conv3_bn(self.conv3(x)))47

48 # dropout on max pool layer49 x = self.mp1_dp(x)50

51 # relu after batch_norm52 x = F.relu(self.conv4_bn(self.conv4(x)))53 x = F.relu(self.conv5_bn(self.conv5(x)))54 x = F.relu(self.conv6_bn(self.conv6(x)))55

56 # dropout on max pool layer57 x = self.mp2_dp(x)58

59 # relu after batch_norm60 x = F.relu(self.conv7_bn(self.conv7(x)))61 x = F.relu(self.conv8_bn(self.conv8(x)))62 x = F.relu(self.conv9_bn(self.conv9(x)))63

64 # x here has shape (batch_size, num_features, w, h) which is (64,100, 1, 1)→

65

66 # global average pooling for each feature map (mean over theflattened feature map dimension)→

67 x = torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)68

69 # returns tensor of shape (batch_size, num_classes)

Page 89: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.2 Noise Impact Analysis 69

70 # crossentropy loss takes care of softmax71 return x

Listing 2 All Convolutional Net

A.2 Noise Impact Analysis

This section includes the code to analyze the impact of noise on the baseline models, firstthrough the computation of the Earth mover’s distance, to measure convolutional filterssusceptibility to input data distortion. Then, the noise impact analysis proceeds through theBorda Count election method, to rank filters by their susceptibility to image distortion.

A.2.1 Earth Mover’s Distance Computation

1 import numpy as np2 from scipy.stats import wasserstein_distance3

4 def get_layer_distance_matrix(model, trainloader, trainloader_noisy,device, layer_idx):→

5 '''6 Returns matrix of distances between clean and noisy feature maps

activations for each training image, for the indicatedconvolutional layer

7

8 :param model: Trained baseline model9 :param trainloader: Dataloader of clean training images

10 :param trainloader: Dataloader of clean training images11 :param trainloader_noisy: Dataloader of noisy training images12 :param devide: Device object (holds info to train on GPU)13 :param layer_idx: Index of the convolutional layer of which we

want to rank its feature maps→

14 :return: 2D Numpy array of distances between clean and noisyfeature maps activations→

15 :rtype: ndarray16 '''17

Page 90: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

70 Code Listings

18 # Get feature maps activations from the baseline model19 feature_maps_clean = get_feature_maps_activations(model,

trainloader, device, layer_idx=layer_idx)→

20 feature_maps_dirty = get_feature_maps_activations(model,trainloader_noisy, device, layer_idx=layer_idx)→

21

22 # List of np_arrays to store the distance matrices23 # img_idx indexes images (samples)24 # feat_map_idx indexes the batch_size of the feature maps of the

given layer→

25 # Imgs on rows, feature maps on columns26 # Each value represents the wasserstain distance between the same

index feature map for that image between clean and dirtyversion

27 distance_mat = np.array(28 [29 [wasserstein_distance(30 feature_maps_clean[img_idx, :, :,

feat_map_idx].flatten(),→

31 feature_maps_dirty[img_idx, :, :,feat_map_idx].flatten())→

32 for feat_map_idx in range(feature_maps_clean.shape[3])33 ]34 for img_idx in range(feature_maps_clean.shape[0])35 ]36 )37

38 return distance_mat

Listing 3 Earth Mover’s Distance Computation

A.2.2 Ranking Convolutional Filters per Layer with Borda Count

1 import numpy as np2 from collections import Counter3

Page 91: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.2 Noise Impact Analysis 71

4 def get_most_changing_feat_maps(layer_distance_matrix, borda_n=10):5 '''6 Returns ranking of the indices of the feature maps that change

the most, according to Borda count→

7

8 :param ndarray layer_distance_matrix: Rows are images, columnsare EMDs between clean and noisy activations for each feature mapactivation

9 :param int borda_n: Number of points given to the first candidatein the borda Count→

10 :return: Feature map indices that would sort the arraydescendingly by their Borda counts→

11 :rtype: ndarray12 '''13

14 # matrix of indices N x borda_n15 arg_sorted_dist_mat = np.argsort(-layer_distance_matrix,

axis=1)[:, :borda_n]→

16

17 assert (borda_n == arg_sorted_dist_mat.shape[1]), "borda_n doesnot equal the number of columns of the argsorted matrix"→

18

19 # list of borda_n Counter objects that are ordered by points thatshould receive by the borda Count method→

20 # elements at position [0] in the list receive 10 points, and soon→

21 counters_list = [Counter(arg_sorted_dist_mat[:, column]) forcolumn in range(borda_n)]→

22

23 # increment the counter objects by their points based on theposition→

24 points = borda_n25 for counter_obj in counters_list:26 for key in counter_obj:27 counter_obj[key] *= points28 points -= 1

Page 92: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

72 Code Listings

29

30 # create a Counter object as the sum of all Counter objects31 sum_counter = sum(counters_list, Counter())32

33 # array to hold all indices of the feature maps34 feature_maps_indices = np.arange(layer_distance_matrix.shape[1])35 # for each feature map, get its borda count36 feature_maps_borda_counts = np.array([sum_counter[feat_map_index]

for feat_map_index in feature_maps_indices])→

37

38 assert (len(feature_maps_borda_counts) ==layer_distance_matrix.shape[1]), "feature_maps_borda_countsis not as big as the number of feature maps!"

39

40 return feature_maps_borda_counts.argsort()[::-1]

Listing 4 Ranking Convolutional Filters per Layer with Borda Count

A.3 Models for Selective Filter-Level Fine-Tuning

This section, along the lines of the Python modules defined in Section A.1, illustrates howthe models that enable the selective filter-level fine-tuning are actually implemented. Bothmodels make use of the reorder_filters function, which is able to reorder the convolutionalfilters of each fine-tuned convolutional layer to properly merge the activation maps of thelayer after they have been split for training purposes. The function’s implementation can beobserved here, followed by the two models definitions:

1 import torch2

3 def reorder_filters(x, indices):4 '''5 Puts the retrained filters at the correct index after merging the

layers that were split for fine-tuning purposes→

6 :param Tensor x: Torch Tensor containing the activation maps tobe reordered→

Page 93: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.3 Models for Selective Filter-Level Fine-Tuning 73

7 :param ndarray indices: ndarray of len=n containing the indicesof the last n activation maps, before being split to fine-tunethe corresponding filters

8 :return: Torch Tensor of activation maps stacked in the correctorder→

9 :rtype: Tensor10 '''11 filters_count = len(indices)12

13 # activation maps of frozen layer14 frozen = x[:, :-filters_count, ...]15 # activation maps of retrained layer16 retrained = x[:, -filters_count:, ...]17

18 # i is the index in of the element in the indices list19 # idx is the original index of the activation map20 # this function places retrained[i] in position idx of frozen

array→

21 for i, idx in enumerate(indices):22 before = frozen[:, :idx, ...]23 after = frozen[:, idx:, ...]24 item_to_insert = retrained[:, i, ...].unsqueeze(1)25 frozen = torch.cat((before, item_to_insert, after), dim=1)26 return frozen

Listing 5 Merging Activation Maps after the Split

A.3.1 Fine-Tunable Simple Convolutional Net

1 import torch2 import torch.nn as nn3

4 class SimpleConvNetRetrainable(nn.Module):5 def __init__(self, filters_to_retrain):6 super(SimpleConvNetRetrainable, self).__init__()7

Page 94: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

74 Code Listings

8 # dictionary with filters to retrain per layer9 self.filters_to_retrain = filters_to_retrain

10

11 # _f = frozen ; _r = retrainable12 self.conv1_f = nn.Conv2d(in_channels=3, out_channels=32 -

len(filters_to_retrain['conv1']), kernel_size=3,padding=1)

13 self.conv1_r = nn.Conv2d(in_channels=3,out_channels=len(filters_to_retrain['conv1']),kernel_size=3, padding=1)

14

15 self.conv1_bn = nn.BatchNorm2d(num_features=32) # equals thenumber of the previous output channels→

16 self.pool1 = nn.MaxPool2d(kernel_size=2)17

18 self.conv2_f = nn.Conv2d(in_channels=32, out_channels=16 -len(filters_to_retrain['conv2']), kernel_size=3,padding=1)

19 self.conv2_r = nn.Conv2d(in_channels=32,out_channels=len(filters_to_retrain['conv2']),kernel_size=3, padding=1)

20

21 self.conv2_bn = nn.BatchNorm2d(num_features=16) # equals thenumber of the previous output channels→

22 self.pool2 = nn.MaxPool2d(kernel_size=2)23

24 self.fc1 = nn.Linear(16 * 8 * 8, 128) # window_size = 825 self.fc1_dp = nn.Dropout2d(p=0.5)26 self.fc2 = nn.Linear(128, 10)27

28 def forward(self, x):29

30 x_froz = self.conv1_f(x)31 x_retr = self.conv1_r(x)32 x = torch.cat((x_froz, x_retr), dim=1) # merges the two

outputs→

Page 95: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.3 Models for Selective Filter-Level Fine-Tuning 75

33 x = reorder_filters(x, self.filters_to_retrain['conv1'])34 x = self.pool1(F.relu(self.conv1_bn(x)))35

36 x_froz = self.conv2_f(x)37 x_retr = self.conv2_r(x)38 x = torch.cat((x_froz, x_retr), dim=1)39 x = reorder_filters(x, self.filters_to_retrain['conv2'])40 x = self.pool2(F.relu(self.conv2_bn(x)))41

42 x = x.view(-1, 16 * 8 * 8) # window_size = 843 x = F.relu(self.fc1(x))44 x = self.fc1_dp(x)45 x = self.fc2(x)46

47 return x

Listing 6 Fine-Tunable Simple Convolutional Net

A.3.2 Fine-Tunable All Convolutional Net

1 import torch2 import torch.nn as nn3

4 class AllConvNetRetrainable(nn.Module):5 def __init__(self, filters_to_retrain):6 super(AllConvNetRetrainable, self).__init__()7

8 # dictionary with filters to retrain per layer9 self.filters_to_retrain = filters_to_retrain

10

11 self.img_dp = nn.Dropout2d(p=0.2)12

13 # _f = frozen ; _r = retrainable14 # padding one layers15 self.conv1_f = nn.Conv2d(in_channels=3, out_channels=96 -

len(filters_to_retrain['conv1']), kernel_size=3,→

16 stride=1, padding=1)

Page 96: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

76 Code Listings

17 self.conv1_r = nn.Conv2d(in_channels=3,out_channels=len(filters_to_retrain['conv1']),kernel_size=3, stride=1,

18 padding=1)19 self.conv1_bn = nn.BatchNorm2d(num_features=96) # equals the

number of the previous output channels→

20

21 # padding one layers22 self.conv2_f = nn.Conv2d(in_channels=96, out_channels=96 -

len(filters_to_retrain['conv2']), kernel_size=3,→

23 stride=1, padding=1)24 self.conv2_r = nn.Conv2d(in_channels=96,

out_channels=len(filters_to_retrain['conv2']),kernel_size=3, stride=1,

25 padding=1)26 self.conv2_bn = nn.BatchNorm2d(num_features=96) # equals the

number of the previous output channels→

27

28 # padding one layers29 self.conv3_f = nn.Conv2d(in_channels=96, out_channels=96 -

len(filters_to_retrain['conv3']), kernel_size=3,→

30 stride=2, padding=1)31 self.conv3_r = nn.Conv2d(in_channels=96,

out_channels=len(filters_to_retrain['conv3']),kernel_size=3, stride=2,

32 padding=1)33 self.conv3_bn = nn.BatchNorm2d(num_features=96) # equals the

number of the previous output channels→

34

35 self.mp1_dp = nn.Dropout2d(p=0.5)36

37 # padding one layers38 self.conv4 = nn.Conv2d(in_channels=96, out_channels=192,

kernel_size=3, stride=1, padding=1)→

39 self.conv4_bn = nn.BatchNorm2d(num_features=192)

Page 97: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.3 Models for Selective Filter-Level Fine-Tuning 77

40 self.conv5 = nn.Conv2d(in_channels=192, out_channels=192,kernel_size=3, stride=1, padding=1)→

41 self.conv5_bn = nn.BatchNorm2d(num_features=192)42 self.conv6 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=3, stride=2, padding=1)→

43 self.conv6_bn = nn.BatchNorm2d(num_features=192)44

45 self.mp2_dp = nn.Dropout2d(p=0.5)46

47 # padding zero layers48 self.conv7 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=3, stride=1, padding=0)→

49 self.conv7_bn = nn.BatchNorm2d(num_features=192)50 self.conv8 = nn.Conv2d(in_channels=192, out_channels=192,

kernel_size=1, stride=1, padding=0)→

51 self.conv8_bn = nn.BatchNorm2d(num_features=192)52 self.conv9 = nn.Conv2d(in_channels=192, out_channels=100,

kernel_size=1, stride=1, padding=0)→

53 self.conv9_bn = nn.BatchNorm2d(num_features=100)54

55 def step(self, x, conv_f, conv_r, bn, layer_name):56 '''Forward pass on the fine-tunable layers'''57 x_f = conv_f(x)58 x_r = conv_r(x)59 x = torch.cat((x_f, x_r), dim=1)60 x = reorder_filters(x, self.filters_to_retrain[layer_name])61 x = F.relu(bn(x))62 return x63

64 def forward(self, x):65 x = self.img_dp(x)66

67 # forward pass on fine-tunable layers68 x = self.step(x, self.conv1_f, self.conv1_r, self.conv1_bn,

'conv1')→

Page 98: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

78 Code Listings

69 x = self.step(x, self.conv2_f, self.conv2_r, self.conv2_bn,'conv2')→

70 x = self.step(x, self.conv3_f, self.conv3_r, self.conv3_bn,'conv3')→

71

72 x = self.mp1_dp(x)73

74 x = F.relu(self.conv4_bn(self.conv4(x)))75 x = F.relu(self.conv5_bn(self.conv5(x)))76 x = F.relu(self.conv6_bn(self.conv6(x)))77

78 x = self.mp2_dp(x)79

80 x = F.relu(self.conv7_bn(self.conv7(x)))81 x = F.relu(self.conv8_bn(self.conv8(x)))82 x = F.relu(self.conv9_bn(self.conv9(x)))83

84 # x here has shape (batch_size, num_features, w, h)85

86 # global average pooling for each feature map (mean over theflattened feature map dimension)→

87 x = torch.mean(x.view(x.size(0), x.size(1), -1), dim=2)88

89 # returns tensor of shape (batch_size, num_classes)90 # crossentropy loss takes care of softmax91 return x

Listing 7 Fine-Tunable All Convolutional Net

A.4 K-Medoids Clustering

We conclude the code listings with the function for the k-medoids clustering algorithm,which is based on the implementation provided in [2].

1 import numpy as np2

Page 99: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.4 K-Medoids Clustering 79

3 def k_medoids_clustering(distance_mat, k, tmax=100):4 '''5 Apply k-medoids clustering algorithm6

7 :param ndarray distance_mat: NxN distance matrix (pairwisedistances)→

8 :param int k: Number of clusters to be identified9 :param int tmax: Maximum number of iterations to be performed

10 :return ndarray M: Array of indices of the medoids11 :return dict C: Python dictionary (key = label | value = indices

of the datapoints that have that label)→

12 :rtype: tuple13 '''14

15 # determine dimensions of distance matrix16 m, n = distance_mat.shape17

18 if k > n:19 raise Exception('Error: Too many medoids (k>n)')20

21 # find a set of valid initial cluster medoid indices since we22 # can't seed different clusters with two points at the same

location→

23 valid_medoid_inds = set(range(n))24 invalid_medoid_inds = set([])25 rs,cs = np.where(distance_mat==0)26 # the rows, cols must be shuffled because we will keep the first

duplicate below→

27 index_shuf = list(range(len(rs)))28 np.random.shuffle(index_shuf)29 rs = rs[index_shuf]30 cs = cs[index_shuf]31 for r,c in zip(rs,cs):32 # if there are two points with a distance of 0...33 # keep the first one for cluster init34 if r < c and r not in invalid_medoid_inds:

Page 100: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

80 Code Listings

35 invalid_medoid_inds.add(c)36 valid_medoid_inds = list(valid_medoid_inds - invalid_medoid_inds)37

38 if k > len(valid_medoid_inds):39 raise Exception('Error: Too many medoids (after removing

duplicate points)'.format(→

40 len(invalid_medoid_inds)))41

42 # randomly initialize an array of k medoid indices43 M = np.array(valid_medoid_inds)44 np.random.shuffle(M)45 M = np.sort(M[:k])46

47 # create a copy of the array of medoid indices48 Mnew = np.copy(M)49

50 # initialize a dictionary to represent clusters51 C = 52 for t in range(tmax):53 # determine clusters, i. e. arrays of distance_mat indices54 J = np.argmin(distance_mat[:,M], axis=1)55 for kappa in range(k):56 C[kappa] = np.where(J==kappa)[0]57 # update cluster medoids58 for kappa in range(k):59 J =

np.mean(distance_mat[np.ix_(C[kappa],C[kappa])],axis=1)→

60 j = np.argmin(J)61 Mnew[kappa] = C[kappa][j]62 np.sort(Mnew)63 # check for convergence64 if np.array_equal(M, Mnew):65 break66 M = np.copy(Mnew)67 else:68 # final update of cluster memberships

Page 101: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective

A.4 K-Medoids Clustering 81

69 J = np.argmin(distance_mat[:,M], axis=1)70 for kappa in range(k):71 C[kappa] = np.where(J==kappa)[0]72

73 return M, C

Listing 8 K-Medoids Clustering

Page 102: Optimization of Convolutional Neural Networks: Transfer ... · Optimization of Convolutional Neural Networks: Transfer Learning for Robustness to Image Distortion through Selective