metadata validation using a convolutional neural network1335777/... · 2019-07-07 · metadata...

Metadata Validation Using a Convolutional

Neural NetworkDetection and Prediction of Fashion Products

Henrik Nilsson Harnert

Computer Science and Engineering, master's level

2019

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

Abstract

In the e-commerce industry, importing data from third party cloth-ing brands require validation of this data. If the validation step of thisdata is done manually, it is a tedious and time-consuming task. Partof this task can be replaced or assisted by using computer vision toautomatically find clothing types, such as T-shirts and pants, withinimported images. After a detection of clothing type is computed, it ispossible to recommend the likelihood of clothing products correlatingto data imported with a certain accuracy. This was done alongside aprototype interface that can be used to start training, finding clothingtypes in an image and to mask annotations of products. Annotationsare areas describing different clothing types and are used to train anobject detector model.

A model for finding clothing types is trained on Mask R-CNNobject detector and achieves 0.49 mAP accuracy. A detection takejust above one second on an Nvidia GTX 1070 8 GB graphics card.

Recommending one or several products based on a detection take0.5 seconds and the algorithm used is k-nearest neighbors. If predic-tion is done on products of which is used to build the model of theprediction algorithm almost perfect accuracy is achieved while prod-ucts in images for another products does not achieve nearly as goodresults.

Contents

1 Introduction 51.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . 81.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 82.1 Image Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Common Object in Context . . . . . . . . . . . . . . . 102.3.2 Modanet . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 Subset of Total Spotin Dataset . . . . . . . . . . . . . 11

3 Theory 123.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . 123.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . 143.3 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . 163.4 Limited Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . 173.4.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . 18

3.5 Mean Average Precision . . . . . . . . . . . . . . . . . . . . . 193.6 Object Detection Methods . . . . . . . . . . . . . . . . . . . . 21

3.6.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . 213.6.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 21

3.6.2.1 Region Based Detectors . . . . . . . . . . . . 223.6.2.2 Single shot detectors . . . . . . . . . . . . . . 223.6.2.3 Comparing the numbers . . . . . . . . . . . . 23

3.7 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 23

4 Implementation 244.1 Back End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.1.1 Data Augmentation . . . . . . . . . . . . . . 29

4.2.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Canvas Library . . . . . . . . . . . . . . . . . . . . . . 314.3.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . 334.3.4 Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Evaluation 365.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.1 Modanet Dataset . . . . . . . . . . . . . . . . . . . . . 365.1.2 Spotin Dataset . . . . . . . . . . . . . . . . . . . . . . 395.1.3 Timing Performance . . . . . . . . . . . . . . . . . . . 44

5.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.1 Timing Performance . . . . . . . . . . . . . . . . . . . 44

5.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3.1 Timing Performance . . . . . . . . . . . . . . . . . . . 46

5.4 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Discussion 466.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1.1 Modanet Dataset . . . . . . . . . . . . . . . . . . . . . 466.1.2 Spotin Dataset . . . . . . . . . . . . . . . . . . . . . . 47

6.1.2.1 Transfer Learning with COCO . . . . . . . . 476.1.2.2 Transfer Learning with Modanet . . . . . . . 486.1.2.3 Complete Spotin Dataset . . . . . . . . . . . 486.1.2.4 Overfitting . . . . . . . . . . . . . . . . . . . 48

6.1.3 Comparing Result to Other’s . . . . . . . . . . . . . . 496.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.1 Defining Users Expectations . . . . . . . . . . . . . . . 496.2.2 Timing of Detection . . . . . . . . . . . . . . . . . . . 496.2.3 Comparing Result to Other’s . . . . . . . . . . . . . . 50

6.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3.1 Timing of Prediction . . . . . . . . . . . . . . . . . . . 506.3.2 Comparing Result to Other’s . . . . . . . . . . . . . . 51

6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 516.5 Video Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . 526.6 Ethical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.7 Encountered Problems . . . . . . . . . . . . . . . . . . . . . . 52

7 Conclusion 53

8 Future Work 54

References 56

Appendices I

A Back End API I

B Translation of Category Ids from Modanet’s to Spotin’s II

C Training on Modanet Dataset With ”all” Layer, COCO Weightfor TL II

D Training on Modanet Dataset With ”heads” Layer, COCOWeight for TL V

E Training on Spotin Dataset With ”all” Layers, COCO andModanet #94 for TL VIII

F Training on Spotin Dataset With ”all” Layers, COCO andModanet #94 for TL XI

G Training on Spotin Dataset With ”all” Layers, COCO andModanet #94 for TL XIII

H Timing Performance of Detection on Different Hardware XX

I Prediction With Uniform as Weight Scale XX

J Prediction With Distance as Weight Scale XXIII

K Timing Performance of Prediction on Different HardwareXXVI

1 Introduction

In our everyday tasks we subconsciously process data without thinking aboutit. For us humans it comes naturally but can be a real time sink if thegathered data has to be projected onto another medium. Within the e-commerce industry for fashion a common task is to iterate over all newproducts and make sure that the metadata such as header, category, priceand so on is correct. This process can be somewhat fast if only one productat a time has to be processed, as most of the information fit within one viewwithout the need to scroll through lists and change windows. But if a productcan relate to other products the time to analyze metadata for a product canget very time-consuming without efficient tools.

1.1 Background

Figure 1: Spots placed on third party products in Spotin’s catalogue. Screenshot taken from Spotin’s demo page [1].

Spotin [2] is developing a sales platform for distributing sales via thirdparty content, as well as for digital marketing. The goal is to simplify theprocess of purchasing products online and giving the brands control of howthey are exposed. By importing brands data of products and displaying thatinformation within third party content Spotin has to handle a large amountof data and make sure it is displayed in a way that both the brand and

5

content makers are satisfied. On the market there are several e-commerceplatforms, such as WooCommerce, Magento, Shopify and so on, that Spotinhas to integrate with. In order to retrieve data of products an important stepin the process of importing is to validate that the data is imported correctly.

One way Spotin display these retrieved products is trough so called spots,as seen in Fig. 1. When retrieving product data from the different brandsthere is no data for the placement of these spots, therefore this has to bedone manually and is one of the more time-consuming tasks. These spotsare interactable elements that is used to display further information aboutthe products. Most of the time these images contain other products as wellthat need to have spots placed on that region on the image that representthe correlating product. There are currently filters that help speeding upfinding these correlating products but this process still takes considerabletime scrolling through lists of thousands of products.

Computer vision, CV, will be used to speed up this task of finding andsuggesting possible products within images of fashion products without theneed for filtering. By annotating a dataset with masks correlating to the areaof different clothing types within a products images, it is possible to createand train a model that will be able to find masks of clothing types. The modellearn by comparing the result to a ground-truth, which is the mask that wasmanually annotated. Finding clothing types within images is hereafter calleddetect/to detect. Another algorithm, called k-nearest neighbors, K-NN forshort, will be used to suggest likely products based on a mask found by theCV model. Identify these likely clothing products based on a detection ishereafter called predict/to predict.

1.2 Problem Definition

The following questions will be answered:

PD.1 With a dataset containing only a few hundred samples, what accuracyof detection is possible to achieve?

PD.2 Based on masks from category detection is it possible to identify whichunique product was found?

PD.3 Is it possible to get detections and predictions within a few seconds?

PD.4 What is the complexity to use and implement the solution?

PD.5 What are the possibilities to use this implementation on video?

6

1.3 Scope

In order to reduce the scope of this master’s thesis few limitations had to beset:

Firstly, product images with only one person will be chosen for trainingand detection. This is to not confuse the algorithm when handling someclothing types. For example, jackets and shoes that often have multiple masksin an image because the jacket is open and the model’s feet are separated.

Only one dataset is used other than Spotin’s own dataset for training inorder further reduce the scope and be able to focus on other parts of theimplementation. Furthermore, from Spotin’s catalogue only part of it hasbeen annotated.

Not all clothing categories that Spotin’s catalogue consists of were anno-tated. The most common clothing types that is present in all datasets will bechosen. Categories that are similar will also be chosen in order to see if themodel can differentiate between them. In total of eight clothing categorieswill be chosen for annotation in order to limit the amount of annotation andfocus on training.

For all machine learning the training data is key to a good result. An-notating masks of clothing types in images does take a considerable time ascreating the polygon data describing the ground-truth of a clothing type hasto be done with high precision. A balance of implementing and annotationproducts during the implementation phase.

7

1.4 Acronyms and Abbreviations

ANN Artificial Neural NetworkAPI Application Programming InterfaceCNN Convolutional Neural NetworkCOCO Common Object in ContextCV Computer VisionDL Deep LearningFastR-CNN Fast Region-based CNNFR-CNN Faster Region-based CNNK-NN K-Nearest NeighborsmAP Mean Average PrecisionMR-CNN Mask Region-based CNNR-CNN Region-based CNNREST REpresentational State TransferRWS RESTful Web ServiceSSD Single Shot DetectorTL Transfer LearningYOLO You Only Look Once

1.5 Thesis Structure

In the next chapter, work that has tried to solve a similar problem willbe discussed and what may have worked or not will be the base for theimplementation of this thesis. The theory and the different models for imagedetection that was considered will be compared in the Theory chapter. Inthe chapter Method the choices during implementation and the visual resultwill be displayed. During the Result chapter different parameters will betested and during the Discussion they will be evaluated. Finally, this thesiswill finish up with a conclusion and what needs to be done in the future.

2 Related Work

Detection of and recommendation of fashion products from images are noth-ing new and has been visited and revisited numerous times. Each time im-proving in accuracy and time of execution. But as machine learning is stillsuch an evolving field each year new approaches to existing problems aresolved. Many of the related works that were found will be audited for theirsolution to detect clothing types and prediction of possible similar products.

8

2.1 Image Detection

Most similar of the sought use case is Deep Fashion for Fast and AccurateFashion Item Detection [3]. The goal was to faster detect clothes with thehelp of a CNN than current available methods while keeping accuracy. Themodel they used was FastR-CNN and replaced the current region proposalgeneration that at the time for the R-CNN models were Selective Search [4]with MultiBox. They managed to reduce time of computation by a tenfold[3]. Their result showed that the original method took 2600 milliseconds tocompute while by replacing Selective Search with MultiBox they managed toreduce compute time to 310 milliseconds while still obtaining high accuracy[3].

Another approach to a similar problem is to do like Image Based Fash-ion Product Recommendation with Deep Learning [5]. Their goal is slightlydifferent in a way that they wanted to extract recommendations based onpatterns and textures of fashion products. As a backbone for the CNN theycompared AlexNet and BN-inception for detections of features and for predic-tions K-NN was used with these features as input. Backbone is the structureof the layers in the network.

A deep learning pipeline for product recognition on store shelves [6] is adifferent use case but similar execution to what has been seen so far. Theyused You Only Look Once [7], YOLO, for image detection.

2.2 Performance

On top of being able to be able to deliver a model with high accuracy peopleare expecting to get the results delivered as fast as possible. According toa survey [8] done by Limelight Networks they concluded that almost half ofpeople are not willing to wait more than five seconds. Out of the participantsasked 19.2% are willing to wait three seconds for a website to load. Onlyone-third said that they will stay on a slow loading website. On the contrary43.5% said that they would rather abandon the website in order to seek outthe competitors’ website.

2.3 Datasets

In machine learning the models need something to learn from in order toimprove. This is why datasets are gathered that represent what the wantedresult might look like. The model can then by analyzing these datasetslearn and for each training iteration get better at finding features that areunique for the classes of the dataset. Classes in a fashion dataset can be

9

T-shirt or pants. In this case the dataset consists of images but for otheruse cases for example, stock prediction, the dataset is the history of stocksor for suggesting ending to a sentence the dataset can be based on texts byWilliam Shakespeare.

The datasets chosen for this thesis consists of images and are related toobject detection and mainly fashion. A more general dataset with lots ofobjects is chosen that will be used a foundation for all training and a morefashion related dataset will be used for testing if it is possible to get betterresults by training on another fashion dataset for the dataset that Spotinprovide.

2.3.1 Common Object in Context

Common Object in Context [9]1, COCO, is a well-used dataset for testingCV models to get an estimate how it compares to other models trying tosolve the same problem. Over the years there has been several iterations andimprovements to this dataset and is why it is so widely used. The datasetconsists of images and annotations for each image explaining information ofdifferent objects in the image such as if there exists a car or a pedestrianlocated within the image. In total the dataset consists of 80 different objectcategories.

Because the dataset is used in such extent a lot of training has been doneon the dataset and is why there exists so called weight files, based on thetraining done. These weight files store the current state of which the modelhas derived to after training and validation is completed.

Instead of training the model from scratch each time these weight filescan be used to restore an old state to continue the training. This methodis called transfer learning [10] and can be useful when dealing with smalldatasets. Extracting features and finding unique patterns within a datasetis a similar task for all datasets within CV. Therefore, it is possible to usethe weight file produced from training the COCO model that contain generalobjects such as cars, pedestrians and so on when training on datasets withfashion products.

1COCO is under the Creative Commons Attribution 4.0 License. Only a weight filebased on the dataset is used and can not be tracked back the actual dataset.

10

2.3.2 Modanet

Based on the Paperdoll [11] dataset of street fashion images, Modanet [12]2

aim to fix an issue with low quality pixel annotation of clothing types. Theproblem with Paperdoll dataset is that often annotations are missing forimages and instead marked as an unknown category [11]. This prevents non-labeled clothing types from being identified as background which is a fixfor a problem but not the optimal solution. Modanet address this by re-annotating 55 176 images [12] out of the 339 797 images [11] in the Paperdolldataset. The images of Modanet dataset have high quality pixel and polygonannotations divided into 13 clothing categories [12] and no unknown category.

2.3.3 Subset of Total Spotin Dataset

The catalogue that Spotin has included multiple different brands, each brandhaving multiple collections. The collection that was used as dataset duringthis thesis was Whyred’s collection. This collection include 360 productswith an average of four images per product.

Spotin has a total of 70 categories currently, with time this number willincrease. When manually annotating the Spotin dataset, eight categorieswere chosen in order to reduce the scope. In total 72 products were annotatedevenly distributed over these eight clothing types. This ended up being 269images annotated often containing multiple types of clothes. A total of 431annotations were done in these images.

Table 1: List of annotated clothing type and the distribution

Clothing type Category id Products annotated Annotated imagesShort dresses 12 8 35Shirts 9 8 43Pants 86 9 40Long dresses 34 8 29T-shirts 81 12 24Jackets 77 9 49Knitwear 93 9 22Jeans 73 9 27

2Modanet’s license is Creative Commons Attribution-NonCommercial 4.0 InternationalPublic License and therefore Spotin will not be able to use their dataset after this thesis.Weight files generated based on a dataset can not be traced back to the dataset and cantherefore be used freely.

11

3 Theory

Machine Learning, ML, has exploded in popularity the recent years. One ofthe branches of ML is Artificial Neural Network [13], ANN, has had moreresearch than other branches within ML due to the recent success in solvingreal world problems. Until the recent boom in research, the field has beenlimited due to lack of compute power. The recent shift that cheaper hardwareis available and the possibility to use much more powerful graphics cards forprocessing of data has made the field more accessible for researchers to applyANN to real world problem for solving them. As the field has exploded theamount of research that has been done has lead to increased accuracy of themodels and reduced training and computation time.

These networks are inspired by how the human brain work. The cerebralcortex in our brains is the outer layer of the brain. It contains a large amountof nerve cells, neurons, that are linked together with nerve strands, axons.At the end of these axons there are synapses that connect to other neurons,thus forming a network of neurons [13].

3.1 Artificial Neural Network

An ANN is structured in layers and each layer consists of neurons that hasinputs and outputs. A function is applied onto the input feeding into theneuron. The manipulated input data is the result of this neuron and is thenthe output out of the neuron. By connecting multiple of these nodes inparallel forming a layer and each layer feeding data to the next a networkis formed. These networks are called Deep Learning, DL, networks and anexample can be seen in Fig 2.

A DL network has to have at least three layers:

1. One layer visible to the model where input data is sent.

2. n hidden layers in the middle.

3. One final layer for output of data to the model.

There are two ways neurons can be connected in the network. In a fullyconnected network all neurons in the previous layer connect to the neuron andits output is connected to all neurons in the next layer. Fig. 2 is an exampleof a fully connected neural network as all neurons have a connection to allneurons in the previous and next layer. Another possible way of connectingis receptive fields, the neuron is only fed some inputs from the previous layerand/or connect its output to some neurons in the next layer. This method is

12

Figure 2: A fully connected Artificial Neural Network, from [14], with inputlayer, one hidden layer and output layer.

often used to extract features from images. This can be seen in Fig. 3 wherethe input that is an image and a kernel traversing over the input. The areathe kernel is located on is called receptive field and that data is fed into thenext layer.

Figure 3: Convolutional layer with a receptive field, from [15]. The kernelextract features from an image and feed that data into the next layer.

,

13

There are several variations of neural network that facilitate differentuse cases of training and learning datasets. The most common for imagerecognition is Convolutional Neural Network, CNN for short.

3.2 Convolutional Neural Network

In Convolutional Neural Network [16] the hidden layers consists of fully con-nected layers, convolutional layers, pooling layers and activation functions.

• Fully connected layers In a fully connected layer each neuron hasinput from all neurons in the previous layer and feed forward the resultto all neurons in the next layer.

• Convolutional layers Also known as receptive fields that was pre-viously described. Using convolution layers in the beginning of thenetwork and throughout the network help to reduce the amount ofdata processed. The convolution layer divide the input into smallerchunks that can then be processed instead of handling the completeinput data. An extracted feature can be discarded if it does not fulfillthe wanted criteria. A stride can be used to make the kernel step morethan one pixel at a time. A stride number larger than 2 is rare.

• Pooling layers Pooling is an important concept of an ANN that is aform of non-linear down-sampling. By taking an input and dividing itinto a grid and applying a function onto each grid cell a smaller outputis generated. The size of the division is called stride. There are severalcommon functions that is used for ANNs. The most common one ismax pooling where the largest value from each cell is chosen. Typically,a max pooling layer output a 2× 2 shape.

Another pooling operation is to take the average value of the valuesin a cell as output. A third pooling operation is euclidean norm thatis calculated by taking each value to the power of two and taking thesquare root of the combined powers for each cell in the grid.

E(x) =√x21 + x22 + · · ·+ x2n (1)

A pooling layer like the above in Fig. 4 with a stride of 2 will effectivelyreduce the number of activations by 75% reducing the training anddetection time considerably.

• Activation functions By taking the input and applying a functionan output is generated. Often these function generate an output for

14

Figure 4: An example of max pooling layer, from [17], for a 4× 4 image witha single depth and a stride equal to 2 result in a 2× 2 shape.

a range, for example from 0 to 1. The two most common activationfunctions are:

Rectified linear unit The short abbreviation for his is ReLU layerand is a function that take the max of the input and zero effectivelyremove negative values.

f(x) = max(0, x) (2)

Sigmoid function Another common activation function is the sig-moid function:

σ(x) =ex

ex + 1=

1

1 + e−x= (1 + e−x)−1 (3)

ReLU is most common of the two because without sacrificing overallgeneralization of the model the CNN is able to train faster.

These layers together form a so called backbone, that with different struc-ture can perform tasks such as image recognition. These exist several back-bones that will be discusses in this thesis.

All of these neurons are multiplied with so called weights. These weightsare then adjusted for each iteration of training in order to improve the accu-racy of the model. These weights can be stored in a weight file after trainingand validation has been done on the neural network model. Next time theCNN can start from the state when it last finished training.

15

A CNN train by sending in data into its model, in this case images, andobserving the result versus a ground-truth and then adjust the weight of themodel to better fit the ground-truth. The ground-truth is the masks thatcorrespond to different clothes in an image and are manually annotated.

Training is done by splitting a dataset into two parts, one that is used fortraining that is fed into the model and one for validating the accuracy of themodel. A training session is divided into a number of epochs, training cycles.During each epoch a number of images are chosen from the training fold ofthe dataset and a number of images are chosen from the validation fold ofthe dataset. The CNN model use the training set to make adjustments tothe weights to see how it affects the model. These adjustments are thenvalidated onto the validation set and thus the model know if the adjustmentsimproved the model or not.

It is important that the validation and training sets don’t overlap. If thesets do overlap, the model will get very good at identifying those specificimages that is included in both of the sets. How the validation of the modelis done is described in depth in Section 3.5.

3.3 Overfitting and Underfitting

When training a model it increasingly improves at finding features for eachepoch that it trains. But when too much training is performed, a problemcan be that the model can start to lose its capabilities to generalize whendoing detections. Instead, it will learn to do detection in only that specifictraining set of the dataset [18]. This is even more of a problem when dealingwith a smaller dataset as the case is for this thesis. During training thesame images will be reused several times compared to a larger dataset whereimages will be less frequently used.

During training the model will get better at detecting within the givendataset. When overfitting occur, the model perform worse at doing detectionson images not belonging to the dataset. Overfitting might not be a problemif the use case is to only do detections on the dataset itself. But in the caseof Spotin it is likely that the whole public catalogue will not be annotated.Therefore, many products and their images will not be part of the datasetand used to train the model on. The model must be able to find similarfeatures in images not present in the dataset.

A way to deal with overfitting is to apply dropouts throughout the model.A dropout layer discard a percentage of activations from the previous layer.This result in that the model get better at generalizing as some data is lostduring training and therefore the model become more sturdy for a smalldataset.

16

The opposite of overfitting is underfitting, and during the early stages oftraining a model underfitting is occurring. The model can’t extract featurepatterns from the dataset. With more training the model starts to find thesepatterns and can get better at finding detections. This can be seen in theSection 5.1 with results where the model always start with low accuracy andwith time get better at finding features.

Optimal is to train the model so that neither of the above cases occurs.

3.4 Limited Dataset

In the case is with Spotin’s public catalogue only a small subset of the cata-logue is annotated. When having a dataset that is limited in size is challeng-ing because the model easily get overfitted, but there are several methods toovercome this issue. One has been mentioned before and is to find a similardataset that has a larger volume of dataset to train and get good results andthen use the result from that training to fine-tune the model of the smallerdataset. This is called transfer learning [10]. Another possibility is to usedata augmentation [19], DA, to manipulate images in the dataset to extendthe amount of images multiple times.

3.4.1 Transfer Learning

By training a model on the dataset it will slowly learn and increase accuracyover time. This process can be very time-consuming because the model hasto learn from scratch and can take days or weeks in order to arrive to amodel with acceptable accuracy. Often a model train for a given amountof so called epochs and during each epoch the model does a given amountof training steps followed by a given amount of validation steps in order torectify its missteps. A weight file is generated at the end of each epochcontaining the result of that iteration of the models training.

The weight file is then used for continue training where the model lastleft off or can be used for another dataset as base for training that model. Nomatter what objects the dataset contains the features of image recognitionproblems are often quite similar even if the object of the different datasets arecompletely different. Therefore, this technique is often used when traininga new dataset or when dealing with a small dataset. The CNN can starttraining on another datasets weight file or even continue training on its ownweight files where the model last quit thus not having to start over.

In Fig. 5 is a visualization of how training will be conducted. The COCOweight file will always be used for TL when training. The effect that TL has

17

Figure 5: Visualization of the flow of training between datasets.

on accuracy when training Spotin’s dataset will be tested in the followingmanner:

1. Training will be done on the Modanet dataset [12] with COCO weightfile for TL.

2. The best weight file from step 1 will be chosen for training Spotindataset.

3. Finally Spotin dataset will be trained with COCO dataset weight filefor TL.

Then the result from step 2 and 3 will be compared during evaluationand discussion. The weight files that are downloaded, COCO, and generatedby training, Modanet and Spotin, can be seen in Fig 6.

Figure 6: Visualization of the weight files from the training the differentdataset. COCO weight file is downloaded online but Modanet and Spotinweight files are trained by the detection model.

3.4.2 Data Augmentation

When having a small dataset a challenge can be to not overfit the modelwhen training for a longer duration. Having a small dataset can severelyhamper the model from doing generalized decisions [19]. By using DA it ispossible to extend the dataset multiple times. DA is to take an image and

18

to apply a minor alteration to the image to create a brand new completelydifferent image. Humans we might barely see the different but the model willhave doubled the dataset by simply rotating the image one percent to theleft. By chaining these augmentations in sequence and parallel the datasetcan be extended multiple times.

3.5 Mean Average Precision

In order to validate and get an indication of how the model performs, a toolis needed that given the ground-truth and the result from the model cancompute a score. Ground-truth is the mask from the manually annotateddataset and in order to achieve perfect score, 1, compared to the detectionof the same product the masks must be pixel perfect.

This can be done with mean average precision [20], mAP. The averageprecision per class is given by calculating the area under the precision/recallcurve. I order to find these curves, the ground-truth is compared with thedetection generated by the model. Intersection over union, IoU, over a giventhreshold is kept while IoU below is discarded. This threshold will be 50%,0.5 as a value, throughout this thesis if not stated otherwise.

IoU =Area of overlap

Area of union(4)

The detection for each instance correlating to ground-truth with the high-est score count as a true-positive, TP, and the rest of the detections mappingto that ground-truth is marked as false-positive, FP.

By iterating over the detections mapped to a ground-truth the recall andprecision values can be calculated by using the IoU value from that iterationas well as the previous iterations.

Recall values is computed by the ratio between the TP detections to theground-truth instances.

Recall =TP

TP + FN(5)

Precision is calculated by the ratio between the TP detections to all de-tections on that instance.

Precision =TP

TP + FP(6)

Precision values will increase for each true IoU above the 0.5 thresholdwhile recall values will have a zigzag pattern.

19

Table 2: Visualization of calculation of recall and precision.

Iteration IoU >0.5 Recall Precision1 True 1.0 0.332 True 1.0 0.673 False 0.67 0.674 False 0.5 0.675 True 0.6 1.0

The result of plotting the precision and recall pairs from the Table 2 canwe visualized in a graph as seen in Fig. 7. The dips in the zigzag pattersare filled with the maximum value to the right of the dip as seen by the redseries.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Precision on recall

Precision on recallFilled precision on recall

Figure 7: Accuracy of Modanet model for each epoch of training.

The area under the filled curve is then the average precision value that

20

will be of a value between 0 and 1. These scores represent how good thedetection is compared to the ground-truth of that class instance. The meanaverage precision is calculated by the sum of all AP divided by the numberof instances.

mAP =1

n

(n∑

i=1

APi

), n = number of instances (7)

3.6 Object Detection Methods

The use case that is researched during this master’s thesis is a classic objectdetector [21] problem. The possible solutions to the problem are 1 traditionalmachine learning; and 2 a subset of machine learning, deep learning, that usedifferent approaches to solve the same problem of finding objects.

3.6.1 Machine Learning

Traditional machine learning requires predefining features and a model thatare smaller when training. In order to identify objects within images withthis approach features have to be manually extracted and thus make theapproach less desirable for more complex use cases. For simpler use caseswith fewer features and limited complexity this approach can still be viable.Examples of such use cases can be to locate eyes on a face or pedestrians ona sidewalk within an image.

3.6.2 Deep Learning

In recent year with more computational power and cheaper hardware deeplearning has been used more increasingly as deep learning previously hasnot been feasible to run because of the deeper more complex architecture.The difference to traditional object detection is that the models used withdeep learning is a complete end to end model that extract features as wellwhile training making these models more robust and better for use cases withdifferent objects. With increased research within the field progress has beenmade that has led to increased accuracy and decreased computation time.

There are several deep learning methods for detecting objects within animage and can be categorized into two types, Region Based Detectors andSinge Shot Detectors.

21

3.6.2.1 Region Based Detectors

Object detectors previously used sliding-window detection to extract regionsfor a limited set of categories. This worked for categories with a sharedaspect ratio such as face [22, 23], hand [24] or pedestrians [25].

CNN based object detectors had until then been slow due to the amountof features that sliding-window detection methods extracted for region pro-posals. A different approach was released 2014 and was called Region basedConvolutional Neural Network, R-CNN for short.

Instead of the sliding-window detection method Selective Search [4] isused to generate 2000 proposal regions within and image. Proposals are gen-erated by a bottom-up operation that keep the region with wanted featuressuch as textures and color intensity based on local cues. These 2000 regionsare warped into predefined image sizes and is then used to train the model.

Since that initial release two additional iterations has been done increas-ing accuracy and speed for each iteration. Fast R-CNN [26] managed toincrease accuracy from 62% mAP to 66% mAP on the Pascal VOC 2012 [27]and is nine times faster than R-CNN during training and for detection 213times faster.

Faster R-CNN [28], FR-CNN, reduced the time for generating proposalsto 10ms by implementing Region Proposal networks that shares layers withthe convolutional neural network to generate 300 feature proposal regions.Reducing the detection time further to 200ms per image.

All these implementations computes bounding boxes that surround theproposed detection. Bounding boxes are squares that contain the object inquestion. Mask R-CNN [29], MR-CNN, is different in that aspect. Insteadof generating boxes that display the detection it generated masks that followthe outline of the detected object. Only adding a small overhead but stillkeeping the near real-time performance of FR-CNN.

3.6.2.2 Single shot detectors

Instead of having two separate set of layers to first find region of interests andthen the rest of the CNN to the actual object detection, single shot detectorscombine both of the problems into one CNN.

Single Shot Multibox Detector [30], SSD for short, run a CNN on aninput image once and generates a feature map. On this feature map a 3× 3convolutional kernel is run to generate bonding boxes and probability ofclasses of those bounding box. SSD also use anchor boxes of different sizesjust like R-CNN methods in order to learn the offset of bounding boxes ratherthan learning the bounding box.

22

Another single shot detector is You Only Look Once [31, 7, 32], YOLO,and work by dividing the input image into a grid of S × S smaller images.Each of these smaller images are used to predict bounding boxes. In total Nbounding boxes per smaller image is produced. Each bounding box containclassification score for all classes and a score representing the likelihood thatthat bounding box actually contains an object, independent of class. Com-bining these two scores gives a total score of the likelihood of a specific classin a box. A drawback with YOLO is that gird cells only predict one class inthe end and therefore has a hard time detecting very small objects.

3.6.2.3 Comparing the numbers

Table 3: Comparing object detectors on Pascal VOC 2007 [33] + 2012 [27]datasets.

Object detector mAP FPS(Frames per seconds)FR-CNN 76.4 5SSD 500 76.8 19YOLO v2 544 × 544 78.6 40

The different object detectors are performing quite similar in accuracy,as seen in Table 3, but differentiate quite a bit in speed, namely that R-CNNis slower than the other two. As real-time in the aspect of high frame rate isnot needed for the use case FR-CNN will suffice. Accuracy wise the detectorsperform similar and therefore doesn’t matter which one is chosen. As thedata from the detection will be displayed to the end-user the mask capabilityof MR-CNN, the extended version of FR-CNN, is interesting. Having thedetection follow the outline of the detected clothing type was a good-to-havepoint and therefore MR-CNN chosen as the object detector.

3.7 K-Nearest Neighbors

In order to find related products k-nearest neighbors [34], K-NN, algorithmwill be used. As the name indicates, the algorithm is used to find the kclosest neighbors to a query point. The neighbors reside is a n dimensionalspace. All input data must have the same dimensions.

In Fig. 8 there are two types of input, blue squares and red triangles.All inputs consist of three parameters, the first describing the type of theinput, the label, in this case blue square or red triangle, and the other twoparameters being the X and Y coordinates that place the input in the space.

23

The coordinated used as input will in this case result in the dimensionalspace.

If k is set to three, the solid line in Fig. 8, two red and one blue dotswill be nearest to the green query point if using quantity by weight to decidewhich type to chose then the red triangles are more and therefore chosen asthe resulting value. If k on the other hand is five, as shown with the dottedline in Fig. 8, the query will instead find two red triangles and three bluesquares. In this case the blue squares are greater and therefore chosen as theresulting value.

Figure 8: Visualization of k-nearest neighbors, from [35], in a 2D space withdifferent k-values.

The weights for deciding which neighbors that should be chosen that willbe analysed are:

uniform Uniform weights. All points in the n dimensional space are weightedequally.

distance Weight points by the inverse of their distance. In this case, closerneighbors of a query point will have a greater influence than neighborswhich are further away.

4 Implementation

A prototype was implemented to test the functionality researched. Thisprototype is be able to trigger and execute detections to find clothing typesin an image and predictions to get recommended products based on a maskgenerated from a detection. It is also possible to start training the detectionmodel through this interface.

24

The object detector, MR-CNN [29], has been wrapped in nodes that isbehind a back end. The nodes and back end communicate through a Redisdatabase. A Redis database is an in-memory structure store that work asa database, cache or message broker. In-memory mean that if the power islost and the computer is shut down data is not persistent. In this case it isfine as if the connection is lost no computation can be done anyway. Becausethe data is stored in memory, the data is faster to access that traditionalhard disk storage. This is good as unnecessary overhead will result in slowerrequests for the user.

The back end can be accessed through a RESTful Web Service, RWS[36], an interface accessible trough the HTTP-protocol. The decision for aREST API, Application Programming Interface, was because it is versatileand easy to work with. It can be expanded upon and to add security in thefuture, depending on the route, and it makes for low coupling between backend and front end. In Fig 9 three nodes are connected to the back end.

Accompanied by the back end a front end is used to access the API. Thisfront end is a prototype and should be replaced by a more robust interfaceor implemented into an already existing platform. Included in the front endis an annotation tool that made it easy to label dataset with clothing types.

Figure 9: Component diagram over back end describing the flow of requestsmade by a user via the REST API and then redirected to a node in thenetwork.

4.1 Back End

Back end consists of several components that has different tasks. The firstcomponent that is reached upon a request by a user is the REST API. Thenext component is what bind the back end and the nodes together, a Redisdatabase, that is used to store requested tasks made by users and to store

25

the result when the task has been completed and is ready to be returnedby the API. Python has been used for the back end as it is used for manyof the machine learning libraries and therefore used in the nodes. In orderto be consistent Python was used in back end as well. For the frameworkthat handle routes of the API Flask was chosen because of having previousexperience with it and it being lightweight, not requiring a lot of systemresources.

The REST API has many routes for tasks such as detections within animage, prediction of products and starting training. But also for fetchingstatus about the state of back end and the nodes. Status such as the numberof nodes connected and the current status of these nodes if they are doingtasks or are idle. But also routes to fetch data of categories and products. InAppendix A the routes that the back end currently has is listed with how theroute is structured, what type of request it is as well as possible parametersand the expected return values.

When a user does a request for either starting training or doing a detectionthe back end handle it by queuing the request in the Redis database andwaiting for a node to process the request.

For training a model the back must have at least one node connected withthe required hardware for training. This requirement is set in the docker-compose.yml file in the root of the project. It does not matter if the nodeis already doing some work when the request is queued up in the traininglist. For detection there is no hardware check for the request, the request issimply queued into the detect list of the Redis database. Each request has aunique id generated before being enqueued for being processed. This is doneso that each request can be tracked.

The back end has a component that handle the data stored in the Redisdatabase with a periodic interval. The component check if registered nodeshas for some reason disconnected. If that is the case then the node is re-moved for the list of active nodes until it re-register itself. This componentalso handle requests stored in the lists described above for training and de-tection tasks. When a request is stored for processing it iterates over thenodes connected in the network. If a node is available and has the necessaryrequirements for the task the request is assigned to that node by storing therequest on that node’s id. By checking node’s id in the database the backend know which nodes that are occupied processing requests and which thatare idle. As soon as the training request is put into the list for processingthe request return an OK response to the caller.

Detections are also assigned to a node by assigning the task to the node’sid. The back end wait up to 30 seconds before timing out if the node has notfinished processing the task. If the detection is done within that time frame

26

the return data is stored in the Redis database under the unique id for thatrequest, hence the back end know when a detection has been processed andcan be returned to the caller.

4.1.1 Prediction

Doing the lookup for finding suggested products based on the masking ofa clothing type is done on the back end with K-NN. In order to do thesepredictions the model has to be built. The time-consuming part is buildingthis model. Input data for the model is based on the pixel data describingthe mask of a clothing type. There are two sources of masks, first the rawannotation data that is manually labeled and secondly the mask generatedfrom doing detections.

A problem is that the model built from the manually annotated imageswould be quite limited in size. This is because with time Spotin’s datasetincrease in size with more products. But not every product might be anno-tated as when the model get good enough to detect all clothing types it issufficient and no more training is needed. New training is needed first whenaccuracy is not good enough because the dataset has increased.

The mask based on manual annotation is limited to the amount of anno-tated images while the seconds type of mask, generated from detections, ismore dynamic. The latter is preferred as long as the detection model has ahigh detect rate.

From the masks from the selected source two different input data can beextracted, first the raw pixel data of the mask can be used as input or secondlythe histogram of the mask can be used. These will later be evaluated.

To each input data the product’s id is used as label for that data. Thelabel is the value that will be returned as the result of a prediction.

4.2 Node

Behind the Redis database one or several nodes can be connected to thedatabase by sending a signal telling the database it is ready for work. Anode can use one or several graphics cards or a CPU to process tasks. Whena node is started its performance is benchmarked and sent to the databasewhat compute score it has when registering that it is ready to receive work.The nodes can do two types of work, the first one all nodes can do. It is todetect clothing types within an image. The second type of work is to trainthe model and in order for the node to be able to do this work it has to havea large enough amount of memory in order to store images and weight files.

27

The score from the evaluation of a node’s computation power is usedwhen delegating work between nodes. When a task for training is requestedthe memory of a node is a limiting factor if the node will accept trainingtasks or not. The available node with most compute score is always chosenfor all types of work. As seen in Table 10, CPU performs much worse than agraphics card in this implementation and will therefore a dedicated graphicscard will be used for training.

When a node is idle it listens for pending work that is stored in the Redisdatabase under that node’s id. When a node is assigned work and is doneprocessing it the result is stored in the database under the id of that request.The complete flow of a users request is described in Section 4.1.

As described in the previous chapter, Section 4.1, the programming lan-guage chosen for the node is Python due to the fact that most machinelearning libraries are implemented in Python.

4.2.1 Training

In order for the node to be able to execute detections the node first has totrain a model that then can be used for detections and building the predictionmodel. The node can take a few parameters when training, the number oftraining and validation steps that should be done each epoch as well asnumber of epochs.

Parameters for weight files used for training and target are optional. Thebase weight file that is trained on the Modanet dataset is used by default.And if no destination file is given the model will save the resulting weightfile as spotin.

MR-CNN, the object detector used for training, has many parametersthat can be changed. Parameters that were changed:

IMAGES PER GPU This was set to 1 due to graphics card used fortraining only having 8 GB of memory. The reason that only one imagecould be tested is that for each image the whole model has to be storedin memory.

STEPS PER EPOCH This is the number of steps the model train forduring one epoch and was changed depending on the size of dataset.For a smaller dataset this was set lower so that the same image wasnot reused several times during an epoch.

VALIDATION STEPS If having this too large or as large asSTEPS PER EPOCH will result in training being slow as for eachepoch this number of steps is done for validation after the training

28

steps. For a small dataset this was set to 0.25 of the training steps andfor larger dataset 0.05 ratio was chosen compared to training steps.

BACKBONE By default this is set to resnet101 and this would most likelyyield higher accuracy. But due to memory issues when training on thelarger Modanet dataset resnet50 was chosen for the whole thesis forconsistency and resnet101 was never tested for the smaller dataset.

NUM CLASSES This should be set to number of classes the dataset con-tain plus one additional class for the background.

The rest of the configuration file was left as the original authors set them.Other parameters that were changed was which layers of the backbone thatwere trained and for how many epochs the model should train.

Because of the dataset being small, augmentations to the images in thedataset are used to extend the volume without having to do any work.

4.2.1.1 Data Augmentation

By doing image manipulation of the dataset it is possible to multiply theamount of data fed when training the model. This is useful if the dataset issmall as with the case of Spotin’s dataset. Augmentations that is done to animage:

1. Horizontal inverting 50% of the time

2. One of the following

(a) Do nothing

(b) Rotate ±3 degrees

(c) Shearing ±5 degrees

(d) Scale ±10 percent

(e) Scale ±5 percent in X-axis or Y-axis

The dataset annotated from Spotin’s public catalogue consists of 269 an-notated images. By training on only these images with no data augmentationthe result would be that the model can only find exactly these images butnot generalize too well when detecting on other images. Applying data aug-mentation to the images in the dataset size will be doubled by flipping theimage horizontally 50% of the time making the dataset effectively 522 images.Doing one more augmentation to the dataset chosen from a list of possible

29

actions would increase the size of dataset five times again. Many of theseaugmentations varying making it even more dynamic.

Dataset increase = 2×

(0.2× 1

+ 0.2×(

6

7× 6

)+ 0.2×

(20

21× 20

)+ 0.2×

(10

11× 10

)+ 0.2× 2×

(10

11× 10

))=

= 20.98528138528 ≈ 21

(8)

As seen in Eq. 8, by doing these augmentations to the images in thedataset result in expanding the dataset almost 21 times ending up with ap-proximately 5645 images instead of 269.

The reason for not applying more augmentations in combination or biggeraugmentation percentage vise is that the augmented dataset should not de-viate too much from the original images. No further investigation was doneif more augmentations would affect performance of the model.

4.2.2 Detection

A request for detection has to contain a URL that point to the image thatshould be processed. An optional parameter is if a specific model for aclothing brand should be loaded and used during the detection. If no brandid is provided, the default model is loaded that has been trained on thecomplete dataset.

Data that is returned as the result of a detection is the masks and bound-ing boxes that were found and which class id, clothing type, the detectionsis as well as the category name that the class id correlates to.

An overhead for downloading the image is added to the time it takes toprocess the request rather than if only the detection was done. On top ofthat, it takes a small amount of time to process and compile the final datathat is returned.

30

4.3 Front End

A prototype of a front end was developed alongside the back end. The frontend developed is a prototype to test if the idea is worth pursuing in the end.Vue.js [37] that was chosen as the framework for the front end. The reasonfor this choice is that with previous experience with JavaScript and Vue.jsbeing an open source project that has had a lot of talk about it recently thecuriosity drove me to take this path.

In addition to displaying the results from the back end detections andstarting training a simple HTML5 canvas tool was needed to annotate masksin the dataset.

4.3.1 Canvas Library

“Paper.js is an open source vector graphics scripting framework that runson top of the HTML5 Canvas” [38]. This library was chosen due havinggeometric tests for its shapes such as if a shape contains a point and if twoshapes intersect. The biggest selling point was the boolean operations of thelibrary, functions to unite, intersect, subtract, exclude, divide and reorienttwo shapes. As the prototype had to have a tool to annotate Spotin’s dataset,this saved a lot of time.

Otherwise, these operations would need to be implemented as it was aneed to combine two masks that were overlapping and are representing thesame clothing type. Another nice to have that was part of the library isthat each clothing type can be represented as a layer. This make it easy tohighlight current selected layer and to dim down layers that are currentlynot active making for a better user experience.

4.3.2 Detection

The interface for testing detections and predictions is quite simple. Thecurrent product is shown in full screen with the possibility to rotate theproduct’s images with the arrows in the bottom left, Fig. 10. These arrowscan be navigated with the arrow keys on the keyboard as well.

Detection of the current product is done by clicking on the green buttonthat will then turn blue while the request is being processed by the back end.If detections are found they are displayed over the image as seen in Fig. 10(b).

31

(a) Before detection. (b) After detection.

Figure 10: Front end interface of detection. Detection is done by clicking thegreen button that will then turn blue and pulsate indicating at the task isbeing processed.

Prediction is done by clicking on the clothing type labels that are dis-played after a detection has yielded results. Ideally the area on the imageshould be interactable able but due to the library handling the HTML canvashaving strange behaviors a simpler solution was chosen by displaying but-tons for the different clothing types to trigger prediction. The ranking of thelikelihood of the prediction is from left to right.

Figure 11: Front end interface of a prediction based in a detection. Predictionis done by clicking the clothing types that the detection has found. Thebuttons for clothing types can be seen in Fig. 10 (b)

A simple side panel was implemented to showcase interesting products.The panel is accessible by clicking on the cogs-icon that can be seen in Fig. 10.

32

Closing the panel is done by clicking outside of it with the mouse of bypressing escape on the keyboard.

Figure 12: Front end interface for changing products.

4.3.3 Annotations

In order to annotate products and images with different clothing types therehad to be a simple interface that display the different clothing types, as seenin Fig. 13 (a). The name of each clothing type and how many products thatare in the dataset is displayed by hovering the thumbnail.

By selecting a clothing type the products in the dataset is displayed along-side how many images each product has and how many of these images thathas annotation data. The user can leave this second view by pressing theback arrow or the escape key on the keyboard.

33

(a) List of all categories that Spotincurrently has on its platform.

(b) Selecting product from previouslyselected category.

Figure 13: Front end interface for navigating to a product that should beannotated.

After selecting a product the product is displayed in a similar view suchas the one for detection. The key differences are the down and up arrows tonavigate to the next and previous product in the dataset of that clothing type.Another key difference is the list of all clothing types that is displayed on thefar right. This list is highlighting the current selected category that will beused to manipulate the masks. It is possible to change selected clothing typeby clicking or scrolling the mouse. As seen in Fig. 14, the first image theselected clothing type is Shirts while in the second the selected clothing typeis Pants. By highlighting the current selected clothing type’s mask it makeit very easy for the annotator to see which clothing type that is currentlybeing edited.

Editing a mask is simple. By using the left mouse button a green maskcan be drawn indicating that the mask is adding to the annotation. If thismask overlap with an existing mask of the same clothing type, they aremerged upon mouse release. If they do not overlap the new mask is treatedas a new segment. This is useful for instances with for example two shoes.If the annotator use the right mouse button, a red mask is drawn indicatingthat area will be removed. If the mask divides an existing mask into two,the result is two segments of that clothing type.

Each time the annotation has been changed the change is uploaded tothe back end.

34

(a) Category Shirts is selected andhighlighted.

(b) Category Pants is selected andhighlighted.

Figure 14: Front end interface for annotating products. A scrollable listmake it easy to switch category to annotate a mask with.

4.3.4 Train

Using the training interface is simple. It is possible to tell the node howmuch training and validation step each epoch should take and for how manyepochs it should train. Brand id is optional and can be used to train ononly part of the dataset. Weight files used as base for training and target isoptional. If no TL weight file is given the default, base, will be used and thesame if no target weight file is given, the default destination is spotin. Assoon as the back end accept the data, an OK is returned. Currently it is notpossible to track current progress of training.

Figure 15: Front end interface to start training a model.

35

4.4 Dataset

In order to make use of the training done on Modanet dataset when trainingwith COCO as TL the category ids of Modanet dataset had to be translatedto similar category ids in Spotin’s dataset. As seen in Appendix B an exampleof mapping in Modanet was footwear that had id 5. A similar category inSpotin’s catalogue is shoes with id 89. The file containing categories inthe dataset has the ids translated to similar ids in Spotin’s dataset and allproducts in Modanet dataset also has its category ids updated.

5 Evaluation

Throughout this section several parameters and theories will be tested tosee which contribute to the best result. Testing is done with the prototypeimplemented in Section 4. Parameters that will be tested are transfer learn-ing, training on one dataset and comparing the result to another dataset,and actual parameters to the MR-CNN model such as epochs, training- andvalidation-steps.

5.1 Training

Testing was done to see what yielded the highest accuracy of the objectdetection model for detecting clothing types. If the detection did not yieldgood results, the predictions model would have fewer matches to build theK-NN space.

5.1.1 Modanet Dataset

Training using transfer learning with the COCO [9] dataset as a base for themodel in order to reduce the amount of training needed on the Modanet [12]dataset with the following parameters:

Table 4: Parameters when training Modanet dataset.

Parameter ValuesTransfer learning weight COCOEpochs 100Layers AllTraining steps 1000Validation steps 50

36

Transfer learning weight means that the weight file is based from themodel trained on another dataset, in this case the COCO dataset. Themodel will be trained on all layers of the backbone, which is ResNet-50 [39].A total of 1000 training steps will be done followed by 50 steps for validatingthe result. This process ran for 100 epochs each iteration of epoch takingin the previous iterations result as a base. From this several weight files aregenerated, one for each epoch.

20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Epoch

Mea

nav

erag

epre

cisi

on[%

]

Training Modanet — Layer: All

Modanet datasetSpotin dataset

Figure 16: Accuracy of Modanet model for each epoch of training theseparameters from Table 4. Data from Appendix C.

Each weight file generated after an epoch can be scored by measuringmAP [20], as described in Section 3.5, on a set of data from a chosen dataset.The set of data is chosen so that it is always the same for every dataset inorder for the score being comparable between different weight files.

Even if the weight files has been trained on the Modanet dataset it ispossible to load the Spotin dataset and evaluate the performance of the

37

weight file. This is because as previously stated, in Section 4.4, that theclasses of Modanet dataset has been translated to the dataset of Spotin.

Table 5: Parameters when training Modanet dataset.

Parameter ValuesTransfer learning weight COCOEpochs 100Layers HeadsTraining steps 1000Validation steps 50

20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

Epoch

Mea

nav

erag

epre

cisi

on[%

]

Training Modanet — Layer: Heads

Modanet datasetSpotin dataset

Figure 17: Accuracy of Modanet model for each epoch of training theseparameters from Table 5. Data from Appendix D.

38

5.1.2 Spotin Dataset

Training a model on Spotin’s dataset in the same way that the Modanetdataset was trained with 1000 steps per training followed by 50 steps valida-tion over either only heads or all layers for 100 epochs resulted in 0.0 mAPover all epochs. Therefore, other parameters had to be used when trainingthe dataset of Spotin. On top of different parameters a decision was madeto training on only two clothing types, namely T-shirts and pants. This de-cision was made to simplify the scope of training so that parameters couldeasier be found that gave results. The final model based on Spotin’s datasetwill be trained on all clothing types.

Table 6: Parameters when training on Spotin dataset with different parame-ters compared to Modanet dataset to find settings that gave result. In orderto simplify the training and get result only two clothing types were trained.

Parameter ValuesTransfer learning weight COCOEpochs 100Layers Heads, 3+, 4+, 5+, AllTraining steps 20Validation steps 5Remark Limited clothing type: T-shirts and pants

39

20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

Epoch

Mea

nav

erag

epre

cisi

on[%

]Training Spotin — TL: COCO

Spotin dataset

Figure 18: Accuracy of Spotin model for each epoch of training these param-eters from Table 6. Data from Appendix E.

From previous results when training the model on Modanet dataset the94th weight file generated performed best with regard to accuracy to bothModanet data and Spotin dataset and is therefore used for TL when trainingthe Spotin dataset.

40

Table 7: Parameters when training on Spotin dataset with different parame-ters compared to Modanet dataset to find settings that gave result. In orderto simplify the training and get result only two clothing types were trained.

Parameter ValuesTransfer learning weight Modanet weight file #94Epochs 100Layers Heads, 3+, 4+, 5+, AllTraining steps 20Validation steps 5Remark Limited clothing type: T-shirts and pants

20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

Epoch

Mea

nav

erag

epre

cisi

on[%

]

Training Spotin — TL: Modanet #94

Spotin dataset

Figure 19: Accuracy of Spotin model for each epoch of training these param-eters from Table 7. Data from Appendix E.

From Fig. 18 and Fig. 19 the new parameters to train the different layersgave better result. The next training is done on the complete dataset and

41

not only T-shirts and Pants. For TL Modanet weight file number 94 is used.

Table 8: Parameters when training on Spotin dataset with different param-eters compared to Modanet dataset to find settings that gave result.

Parameter ValuesTransfer learning weight Modanet weight file #94Epochs 1000Layers Heads, 3+, 4+, 5+, AllTraining steps 20Validation steps 5

200

400

600

800

1,00

0

0

0.1

0.2

0.3

0.4

0.5

Epoch

Mea

nav

erag

epre

cisi

on[%

]


Spotin dataset

Figure 20: Accuracy of Spotin model for each epoch of training these param-eters from Table 8. Data from Appendix F.

Same training as in Fig. 20 but for more epochs to test for overfitting andto use multiple TL based on the same training. First the model trained for

42

1000 epochs and TL with the weight file number 94 from Modanet training,then for 1000 more epochs with TL from the previous 1000 epochs. Finally,500 more epochs with TL from the second training iteration of the 1000epochs.

Table 9: Parameters when training on Spotin dataset with different param-eters compared to Modanet dataset to find settings that gave result.

Parameter ValuesTransfer learning weight Modanet weight file #94Epochs 1000+1000+500Layers Heads, 3+, 4+, 5+, AllTraining steps 20Validation steps 5

500

1,00

0

1,50

0

2,00

0

2,50

0

0

0.1

0.2

0.3

0.4

0.5

Epoch

Mea

nav

erag

epre

cisi

on[%

]


Spotin dataset

Figure 21: Accuracy of Spotin model for each epoch of training these param-eters from Table 9. Data from Appendix G.

43

5.1.3 Timing Performance

The time it takes to train the model is varying depending on the parameters.When training the model on Modanet dataset with 1000 training- and 50validation-steps for 100 epochs it takes around 34 hours to complete thetraining while training the model on Spotin’s dataset it takes less than 8hours. For this training 20 training- and 5 validation-steps over 1000 epochswere used as parameters.

5.2 Detection

After training the model it is possible to load the model and use it for de-tections by hooking to up the back end in nodes.


Time for processing a detection request is something that was evaluatedand was done by selecting one image for sampling. This image was thenused when testing the different hardware. By running the detection requestfive times for each hardware and by taking the mean value of the result arepresentative value for that hardware was concluded.

Table 10: Time for executing detection of clothing types within an image.The time is mean value of five samples. The samples are taken from Ap-pendix H

Node hardware type Computation unit Memory Time (s)GPU Nvidia GTX 1070 8 GB 1.054GPU Nvidia GTX 1050 2 GB 1.278CPU Ryzen 1600 8 cores 8 GB 5.142CPU Ryzen 1600 4 cores 8 GB 6.44CPU Ryzen 1600 4 cores 4 GB 6.64

Doing the detection itself take on average 485milliseconds on an NvidiaGTX 1070 with 8 GB of memory. The remained of time is used by down-loading the image and packaging the result for sending back to the user.

The samples where taken while working remote at a slow library networkand the server hosting the back end and nodes is connected with 250/100Mbit connection. This might result in some overhead compared to workinglocally but also reflect a more real life scenario.

44

5.3 Prediction

In the Spotin dataset a total of 269 images has been annotated with masksdescribing the location of the clothing types. These annotated masks can beused as the input for building up the K-NN model.

As this method for building up the K-NN model is not ideal anothersolution is needed, this was discussed in Section 4.1.1. Another solution is touse the detections from the object detector model to build the K-NN model.This would in the long run generate a larger set of input data then using theannotated data. Only the detections that correlate the product’s category idwill be kept for building the model.

By building up the K-NN model with results of detections a much largerset of the model will be created. Using this solution instead 524 images wherefound instead of when using the manually annotated masks that only yielded269 images.

Then input data to the model is generated with pixel data of the areadescribing the clothing type of the product within these images and the labelis the product’s id.

This will lead to predictions on the same images that were used for gen-erating the model will create a pixel perfect match with a distance of zero.There is another scenario as well in the example when an administrator for abrand has to place spots on an image with multiple clothing types. The im-age of that product might be shared with other products as well, for examplepants visible in T-shirt products images and vise versa. The prediction willbe pixel perfect for both of the products.

Two parameters were chosen to be tested as weight functions, and theywere uniform and distance. The first, uniform, weight all the input param-eters equally while the second weight function, distance, use the inverse ofthe distance, closer points have bigger impact.

Both of the parameters were tested on both pixel data and histogram dataof the images. Two types of prediction were done, one for prediction onlyone product and one for retrieving an array of the three closest products.

When predicting one product, the pixel model found the correct productin 8 out of 100 products tested while the histogram model found 9 out ofthe 100 tested. Histogram being slightly better than the pixel model for theuniform weight function.

Both of the models performed equal when testing to retrieve a list of thethree closest products by finding the product in all 100 tests. The data foruniform function can be found in Appendix I

When evaluating the distance weight function, it performed with muchhigher accuracy. Testing to retrieve one product prediction resulted in both

45

pixel and histogram model finding the product in all 100 tests and the sameresult was given when testing for the product in a list of three closest pre-dictions. The data for uniform function can be found in Appendix J


On average, it took 0.5248 seconds to do a prediction. Data for timingperformance of prediction can be found in Appendix K and the median isbased on five samples.

The samples where taken while working remote at a slow library networkand the server hosting the back end and nodes is connected with 250/100Mbit connection. This might result in some overhead compared to workinglocally but also reflect a more real life scenario.

5.4 Front End

The front end is mostly just a platform for displaying information stored inthe back end, to trigger tasks and showing the result. Therefore, there isreally no need for evaluating the front end other than the annotation tool.

The tool ended up being a good addition to the back end. It is efficientto work with and it is quite fast to annotate images of a product. In thebeginning of this thesis there was a speculation, by doing a detection of aproduct that has not yet been annotated it would speed up the annotationspeed. This is because a rough mask would be generated that then couldbe validated that the mask passed a standard. Later on this was proven tobe wrong as the rough outline still has to be fine-tuned. Meaning that thedetected mask most often does not span all the way out to the edge in theproduct. The mask always a bit smaller than the actual product. Hence, theneed for annotating the time-consuming part still remained, the edges.

6 Discussion

With the help of the theory from Section 3 and the results from Section 5the questions from Section 1.2 will be answered here.

6.1 Training

6.1.1 Modanet Dataset

Several parameters has been presented in Section 5.1 each effecting accuracyof the detection model in its own way. The first and the one that reduce the

46

time needed for training is transfer learning, TL.As seen in Fig. 16, training on the Modanet dataset yield good result in

accuracy for each training epoch. The last 10 epochs seem to stagnate theamount of accuracy increase compared to previous earlier epochs. Indicatingthat the model start to reach its maximum.

In the same figure results of evaluation done with respect to Spotin’sdataset is also shown. Interestingly it is possible to get some results onthe dataset even if that dataset was not trained on. This is because clothingimages in both of the datasets are in many aspects similar by sharing featuresbut also different in other aspects, such as lighting and posture of the model.It is hence not possible to train on one dataset and expect equal results onanother dataset.

An average of ≈ 0.62 mAP3 is achieved when evaluating the Modanetdataset and ≈ 0.13 mAP was achieved for the Spotin Dataset.

Training on only head layers of the model does not yield results as goodas training the whole model as seen in Fig. 17. The average accuracy of thelast ten epochs is only ≈ 0.50 mAP for Modanet dataset and ≈ 0.14 mAP forSpotin dataset. It’s interesting that the evaluation done with Spotin datasetperform slightly better when training only the head layers. This might be acoincidence as each training is different from the other. But can also meanthat by only training the head layer the model doesn’t get so heavily trainedto perform well against Modanet dataset and keep its general object detectioncapabilities from COCO dataset with 80 very different categories.

6.1.2 Spotin Dataset

6.1.2.1 Transfer Learning with COCO

When training with the same parameters as used for the results above butwith Spotin’s much smaller dataset the model got 0.0 mAP as described inSection 5.1.2. This is most likely because the model got instantly overfittedwhen training on head and all layers of the model at once for so many trainingsteps when the dataset is small. In order to confirm this the model is firsttraining on the head layers and inclemently add additional layers as the modelprogress training all epochs. The different layers are head, 3+, 4+, 5+ andall layers.

The epochs were split so that each layer-parameter got an equal amountof epochs. In this case the first 20 epochs out of 100 are used for traininghead layers. The 20 next epochs are used to train heads and down to and

3How mean average precision, mAP, is calculated is described in Section 3.5.

47

including the third layer. And so on for each 20 epoch training more andmore layers until all layers has been trained.

On top of training layers increasingly, only products of T-shirts and pantswere used, as the shape and color is different. If the model manage to differ-entiate these two clothing types, the model will be able to adapt to trainingon the complete dataset further on.

When training on gradually increasing layers and on only two clothingtypes the model manage to get results. Compared when training on onlyhead or all layers when no result was achieved. On average the last epochsresulted in ≈ 0.52 mAP.

6.1.2.2 Transfer Learning with Modanet

This result of ≈ 0.52 mAP was achieved by training with COCO datasetfor transfer learning. By instead using the best weight file from training theModanet for transfer learning higher accuracy was yielded. The result wasthen ≈ 0.57 mAP.

The weight file chosen as the most suitable for TL when training onModanet dataset was the one that accomplish the highest combined accuracy.Multiplying the accuracy of evaluation Modanet dataset with the accuracyof evaluation Spotin dataset for each epoch the weight file from Modanettraining that performed best was the 94th weight file. This weight file willbe used for TL when training Spotin dataset henceforth.

In the following figures from Section 5.1.2 the comparison of using differ-ent files for transfer learning can be seen. Using Modanet for TL, Fig. 19,yielded better results than to train with COCO for TL, Fig. 18.

6.1.2.3 Complete Spotin Dataset

These results were accomplished by training with only two clothing types,but the model managed to produce a result. By enabling all clothing typesagain and train for 1000 epochs instead of 100 epochs and use Modanet forTL the model managed to score result compared to previous attempts whenthe model scored 0.0 mAP. This time the model resulted in on average≈ 0.44 mAP for the last 10 epochs when training on the complete dataset.The result of the training can be seen in Fig. 20.

6.1.2.4 Overfitting

When the accuracy previously was 0.0 mAP, the model had overfitted. Inorder to analyze this behavior and to find the best result the training wasextended. First 1000 epochs were trained with Modanet as TL. The lastweight file from this training was then used for TL the next 1000 epochs and

48

then finally 500 epochs were trained based on the seconds 1000 epochs. Intotal 2500 epochs were trained.

In Fig. 21 the training seem to reach a maximum after ≈ 1600 epochs.The model manage to score ≈ 0.49 mAP as most during that peak. Thenthe model perform much worse and that might be due to overfitting. Evenif few training steps are used each epoch the training is done over a largeamount of epochs for a dataset of this size.

6.1.3 Comparing Result to Other’s

The authors of MR-CNN [29] managed to train the model and to show thatgiven the low accuracy the use case show promise. The original authorsmanaged to get 0.32 mAP on the Cityscapes dataset. What make this datasetinteresting is that this dataset also is limited in annotations per category.Categories for trucks, buses and trains has only 200–500 annotated images.Comparing this to Spotin’s dataset with only 34 annotations per categoryand an accuracy of 0.49 mAP per category show great promise.

PD.1 With a dataset containing only a few hundred samples, whataccuracy of detection is possible to achieve?

With such as small dataset the results are good. Ideally each category shouldcontain at least 1000 annotations while in this case each category only containon average 34 annotations, Section 2.3.3. Having far less and still obtaining≈ 0.49 mAP is good. It is interesting to see is that by increasing the numberof epochs it is possible to score higher accuracy in the case from 1000 to 1600epochs. Increasing too much will result in worse accuracy.

6.2 Detection

6.2.1 Defining Users Expectations

In Section 2.2 a study evaluating the patience of online users is discussedand almost 50% of the users are not willing to wait more than five secondsand 20% are not willing to wait more than three seconds. Therefore, it isimportant that the detections are processed fast.

6.2.2 Timing of Detection

By using an in-memory store for distributing tasks and by MR-CNN beinga fast detector it is possible to do a detection in just over 1 second on anNvidia GTX 1070 graphics card with 8 GB or memory. A slower graphics

49

card, Nvidia GTX 1050 2 GB, manage to do the same task in 1.2 seconds.CPUs can be used but with a Ryzen 1600 with 8 cores activated accompaniedby 8 GB or memory the detection took over 5 seconds. Processors shouldonly be used for detection as backup if all nodes with graphics cards areoccupied.

The time for detection has some overhead when downloading the imageand also to process the data that is returned to the user. The detectionitself take around 0.5 seconds on an Nvidia GTX 1070. Timings can foundin Section 5.2.1.


In the paper for MR-CNN [29] the detection time was 0.195 seconds for anNvidia Tesla M40. They managed 3 fps, frames per seconds, with an NvidiaTesla P100 which result in that a detection that around 0.33 seconds. Bothof these GPUs are faster and has more memory than an Nvidia GTX 1070.

PD.3 Is it possible to get detections and predictions within a fewseconds?

With the expectations from Section 2.2 as target and the result of just overone seconds from Section 5.2.1, the time it takes to do a detection is withinthe time frame most users are prepared to wait.

6.3 Prediction

Two different weight function were evaluated in Section 5.3. The functioncalled uniform managed to predict the correct product less than 10% ofthe time while the much better weight function, distance, found the correctproduct 100% of the time. These results are evaluated on images that wereused to build the K-NN model. It is possible to identify products in imagesthat contain clothes other than the product. For example a product of a pairof pants that the images also contain a T-shirt. But this is more hit or miss.

6.3.1 Timing of Prediction

The time-consuming part of the predictions is building the model as said inSection 4.1.1. This is because for each image fed into the model a detectionhas to be done. When the K-NN space has been built a prediction on CPUand take just over 0.5 seconds, Section 5.3.1. This was done with 5 cores ofRyzen 1600 with 8 GB of memory.

50


Given the low accuracy of prediction for recommended products a betterapproach would be to extract patterns and textures like they did in Imagebased Fashion Product Recommendation with Deep Learning [5].

In A deep learning pipeline for product recognition on store shelves [6]their pipeline for detection and prediction with a CNN and K-NN took lessthan one second. This is comparable to the result in this thesis. 0.5 secondsto run the detection without the overhead to download the image and justunder 0.5 seconds for doing the prediction on the K-NN space.

PD.2 Based on masks from category detection is it possible toidentify which product was found?

It is possible to identify products from images that are used to build theK-NN model. For the model to be able to predict more generally other datawill be more suitable for input data when building the model. Patterns aretextures could be interesting to further research.


With the expectations from Section 2.2 as target and the result of 0.5 secondsfrom Section 5.3.1, the time it takes to do a prediction is within the timeframe most users are prepared to wait.

6.4 Implementation

As the back end expose a REST API, that the front end uses to do allcommunication this result in low coupling between the two components, assaid in Section 4. It is therefore possible to hook up an existing front endto the back end without doing anything. Communication between the backend and the nodes are done via a Redis database that is hosted by the backend further contributing to low coupling.


With low coupling it is easy to hook up an existing platform into the backend and nodes. As well as still having the front end for some tasks, such asannotating.

51

6.5 Video Possibilities

Currently it is possible by analyzing each frame of a video and doing detectionon these frames to create an interactable timeline that match the content ofthe video. With the current solution the video has to be pre-processed aseach detection still take 0.5 seconds, Section 5.2.1.

If true real time performance and detection is wanted another solutionwould be to implement YOLO [32] as it is currently the fastest object detectorbut that would also sacrifice some accuracy.

A third solution for real time processing is to skip all but two video frameseach second and animate the difference in-between.

PD.5 What are the possibilities to use this implementation onvideo?

It is currently not possible to input a video into the model but with a softwaresuch as FFmpeg [40] it is possible to extract frames from a video and feedthem into the model in order to build a timeline of currently visible productsand the location of these products.

6.6 Ethical

This thesis is about analyzing the possibility to find and identify clothingproducts within an image. A natural result in using this technology is toincrease sales by reaching out to more customers and recommend productsthat those costumers are likely to buy. This will lead to increased salesand with increased sales come two aspects. The first being that with moreconsumption a toll on the environment is a risk, with more production andshipping of products. But there is also the aspect that with consumption astrong economy is built.

Integrity intrusion can be a problem by facilitating a tool that let the userupload their own images to identify products. It is then possible to learn theprice of the clothes people in the image is wearing. The tool can then be usedin the wrong environment such as for public humiliation of a public figure oras a tool to harass people.

6.7 Encountered Problems

During training of the model when using Spotin’s dataset for the first timethe model did not yield any hits when doing detections. This was due to themodel getting overfitted much quicker with the smaller dataset of 269 images

52

annotated compared to Modanet’s over 55 000 images. This was solved byreducing the training steps for each epoch.

Training in general was a very time-consuming task often taking well overa day to complete. This combined with that this thesis was conducted overa set amount of time resulted in not all parameters could be analyzed andtherefore a better result would probably be possible with the current datasetgiven more time for further testing of parameters.

The size of Spotin’s dataset is limiting the model from training to ahigh accuracy. Transfer learning is one method to rectify this problem byletting the model inherit weight based from training another dataset. Away of extending the dataset is to use data augmentation on all images andannotations to increase the size of the dataset numerous times.

Paper.js didn’t perform as expected as mentioned in Section 4.3.2. Try-ing to use event listeners when clicking on a detection mask did only getrecognised in the top left quadrant. Therefore, a simpler model for trigger-ing predictions were chosen, to display buttons on the side for each clothingtype that were detected. These buttons can be seen in Fig. 10 (b) while themask displayed in the same image would be the ideal way of interacting andtriggering the prediction.

7 Conclusion

Throughout this thesis several techniques has been discussed around com-puter vision.

Different object detectors has been analyzed by their strengths and weak-nesses. The object detector that was chosen is Mask R-CNN that allow foralmost real time performance while keeping high accuracy. In order to predicta product based on a mask k-nearest neighbors was used.

A tool for annotating products in Spotin’s catalogue was created anda total of 269 images were annotated. When training the model on theseimages, it was possible to create a model with decent result. An accuracy of0.49 mean average precision was possible to achieve.

Key techniques in order to achieve this accuracy was to use transfer learn-ing by reusing the result from previous training on the same or a differentdataset. Data augmentation helped when dealing with a small dataset byextending the dataset numerous times. With simple augmentations it waspossible to extend the dataset 21 times.

53

PD.1 With a dataset containing only a few hundred samples, whataccuracy of detection is possible to achieve?

By using transfer learning to speed up training and training with low steppingparameters to not overfit it is possible to achieve ≈ 0.49 mAP.

PD.2 Based on masks from category detection is it possible toidentify which unique product was found?

It is possible to identify the products that are used to build the model withhistogram data from images. A better solution could be to extract patternsand textures from images instead of histogram.


Detection takes just over one second with the complete pipeline and predic-tion take 0.5 seconds.


Back end and front end have a low coupling by back end hosting a RESTAPI. Therefore, it is easy to hook up existing back end to an existing frontend.

PD.5 What are the possibilities to use this implementation onvideo?

By extracting video frames and doing detections and predictions on them itpossible to build a timeline of currently visible products and the location ofthese products.

8 Future Work

As mentioned in the Scope, Section 1.3, dataset images with only one personwere used this is in order to reduce confusion for some cloth types, for examplejackets or shoes because they often consist of two segments. A way to tacklethis issue could be to use a general person identification model to find thebounding boxes within an image and run training or detection on the boxesseparately. Some issues can still occur. For example if two persons arestanding to close to each other. A way to tackle this issue could be to simplydiscard bounding boxes from images when the two boxes overlap. This way

54

it is guarantied that the model won’t accidentally mix up segments belongingto clothes carried by another person.

Due to limited dataset this thesis focuses on the whole dataset. In afinal product of the clothing detector multiple brands will be included in thisdataset. An interesting discussion could be if training on different brand’simages yielded a difference in accuracy.

The API exposed in the back end is currently open for everyone. Forsome routes such as status, doing detection and predictions this is not toobad. But the route for initiation training should be limited to authenticatedusers only. Another useful part would be to limit the amount of requestsdone by a user so that the API doesn’t get abused. This can be done bylimiting the amount of detection and predictions to for example 500 per day.

Extending functionality of existing routes could be done as well. Forexample there is currently no route for checking status of training in progress.

Caching and adding additional parameters to requests could speed upthe time a request take to be processed. For example two identical requestsfor detection perform the same detection twice, which is unnecessary use ofresources. Caching would solve this issue. Additionally, there is no way ofusing locally stored images. By using images stored on the server the requestswould return faster.

In order to increase the dataset without actually annotating more im-ages data augmentation has been used. The steps chosen when augmenting,the amount of rotation, scaling and shearing were chosen without furtherresearch. Therefore, a more in depth study can be made how augmentationeffect training accuracy. If more augmentation increase accuracy and whatis the limit for augmentation.

55

References

[1] Spotin Demo Page. https://demo.spotin.com. Accessed: 2019-04-15.

[2] Spotin. https://spotin.com. Accessed: 2019-04-15.

[3] Evgeny Smirnov Kuznech, Egor Smirnov, and Karina Ivanova. “DeepLearning for Fast and Accurate Fashion Item Detection”. In: 2016.

[4] Jasper Uijlings et al. “Selective Search for Object Recognition”. In:International Journal of Computer Vision 104 (Sept. 2013), pp. 154–171. doi: 10.1007/s11263-013-0620-5.

[5] Hessel Tuinhof, Clemens Pirker, and Markus Haltmeier. “Image BasedFashion Product Recommendation with Deep Learning”. In: CoRRabs/1805.08694 (2018). arXiv: 1805.08694. url: http://arxiv.org/abs/1805.08694.

[6] Alessio Tonioni, Eugenio Serro, and Luigi di Stefano. “A deep learningpipeline for product recognition on store shelves”. In: CoRR abs/1810.01733(2018). arXiv: 1810.01733. url: http://arxiv.org/abs/1810.

01733.

[7] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”.In: CoRR abs/1612.08242 (2016). arXiv: 1612 . 08242. url: http :

//arxiv.org/abs/1612.08242.

[8] Limelight Networks. The State of the User Experience. 2017. url:https://www.limelight.com/resources/white- paper/state-

of-user-experience-2017/ (visited on 04/15/2019).

[9] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”.In: CoRR abs/1405.0312 (2014). arXiv: 1405.0312. url: http://

arxiv.org/abs/1405.0312.

[10] Chuanqi Tan et al. “A Survey on Deep Transfer Learning”. In: CoRRabs/1808.01974 (2018). arXiv: 1808.01974. url: http://arxiv.org/abs/1808.01974.

[11] Kota Yamaguchi, M. Hadi Kiapour, and Tamara L. Berg. “Paper DollParsing: Retrieving Similar Styles to Parse Clothing Items”. In: 2013.

[12] Shuai Zheng et al. “ModaNet: A Large-Scale Street Fashion Datasetwith Polygon Annotations”. In: ACM Multimedia. 2018.

56

https://demo.spotin.com

https://spotin.com

https://doi.org/10.1007/s11263-013-0620-5

https://arxiv.org/abs/1805.08694

http://arxiv.org/abs/1805.08694








https://www.limelight.com/resources/white-paper/state-of-user-experience-2017/

https://www.limelight.com/resources/white-paper/state-of-user-experience-2017/







[13] B. Mehlig. “Artificial Neural Networks”. In: CoRR abs/1901.05639(2019). arXiv: 1901.05639. url: http://arxiv.org/abs/1901.

05639.

[14] Glosser.ca. Artificial neural network with layer coloring. File: Coloredneural network.svg. 28 February 2013, 13:53:39. url: https://en.wikipedia.org/wiki/File:Colored_neural_network.svg.

[15] Aphex34. Input volume connected to a convolutional layer. File: Convlayer.png. 15 December 2015. url: https://en.wikipedia.org/wiki/File:Conv_layer.png.

[16] Jayanth Koushik. “Understanding Convolutional Neural Networks”. In:arXiv e-prints, arXiv:1605.09081 (May 2016), arXiv:1605.09081. arXiv:1605.09081 [stat.OT].

[17] Username Aphex34 at Wikipedia. Max pooling image. 2015. url: https://en.wikipedia.org/wiki/File:Max_pooling.png (visited on05/07/2019).

[18] Shaeke Salman and Xiuwen Liu. “Overfitting Mechanism and Avoid-ance in Deep Neural Networks”. In: CoRR abs/1901.06566 (2019).arXiv: 1901.06566. url: http://arxiv.org/abs/1901.06566.

[19] Luis Perez and Jason Wang. “The Effectiveness of Data Augmentationin Image Classification using Deep Learning”. In: CoRR abs/1712.04621(2017). arXiv: 1712.04621. url: http://arxiv.org/abs/1712.

04621.

[20] Paul Henderson and Vittorio Ferrari. “End-to-end training of objectclass detectors for mean average precision”. In: CoRR abs/1607.03476(2016). arXiv: 1607.03476. url: http://arxiv.org/abs/1607.

03476.

[21] Zhong-Qiu Zhao et al. “Object Detection with Deep Learning: A Re-view”. In: CoRR abs/1807.05511 (2018). arXiv: 1807 . 05511. url:http://arxiv.org/abs/1807.05511.

[22] & T. Kanade H.A. Rowley S. Baluja. “Neural network-based face de-tection”. In: IEEE Transactions on PAMI (1998).

[23] R. Vaillant, C. Monrocq, and Yann LeCun. “Original approach for thelocalization of objects in images”. English (US). In: IEE ConferencePublication. Publ by IEE, 1993, pp. 26–29.

[24] Steven J. Nowlan and John C. Platt. “A Convolutional Neural NetworkHand Tracker”. In: NIPS. 1994.

57




https://en.wikipedia.org/wiki/File:Colored_neural_network.svg

https://en.wikipedia.org/wiki/File:Colored_neural_network.svg

https://en.wikipedia.org/wiki/File:Conv_layer.png

https://en.wikipedia.org/wiki/File:Conv_layer.png


https://en.wikipedia.org/wiki/File:Max_pooling.png

https://en.wikipedia.org/wiki/File:Max_pooling.png











[25] Pierre Sermanet et al. “Pedestrian detection with unsupervised multi-stage feature learning”. English (US). In: Proceedings of the IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition.2013, pp. 3626–3633. doi: 10.1109/CVPR.2013.465.

[26] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015).arXiv: 1504.08083. url: http://arxiv.org/abs/1504.08083.

[27] M. Everingham et al. The PASCAL Visual Object Classes Challenge2012 (VOC2012) Results. url: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

[28] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object De-tection with Region Proposal Networks”. In: CoRR abs/1506.01497(2015). arXiv: 1506.01497. url: http://arxiv.org/abs/1506.

01497.

[29] Kaiming He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017).arXiv: 1703.06870. url: http://arxiv.org/abs/1703.06870.

[30] Wei Liu et al. “SSD: Single Shot MultiBox Detector”. In: CoRR abs/1512.02325(2015). arXiv: 1512.02325. url: http://arxiv.org/abs/1512.

02325.

[31] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Ob-ject Detection”. In: CoRR abs/1506.02640 (2015). arXiv: 1506.02640.url: http://arxiv.org/abs/1506.02640.

[32] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Improve-ment”. In: arXiv (2018).

[33] M. Everingham et al. The PASCAL Visual Object Classes Challenge2007 (VOC2007) Results. url: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

[34] N. S. Altman. “An Introduction to Kernel and Nearest Neighbor Non-parametric Regression”. In: Biometrics Unit Technical Reports BU-1065-MA (1991). url: https://hdl.handle.net/1813/31637.

[35] Antti Ajanki (AnAj). Example of k-nearest neighbour classificationnb.28.

[36] Gyan Prakash Tiwary and Abhishek Srivastava. “A Novel Approachto Implement Message Level Security in RESTful Web Services”. In:CoRR abs/1609.06012 (2016). arXiv: 1609 . 06012. url: http : / /

arxiv.org/abs/1609.06012.

[37] Evan You. Introduction — Vue.js. 2019. url: https://vuejs.org/v2/guide/#What-is-Vue-js (visited on 05/08/2019).

58

https://doi.org/10.1109/CVPR.2013.465



http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html














https://hdl.handle.net/1813/31637




https://vuejs.org/v2/guide/#What-is-Vue-js

https://vuejs.org/v2/guide/#What-is-Vue-js

[38] Jurg Lehni and Jonathan Puckey. Paper.js. 2019. url: http://paperjs.org/about/ (visited on 05/08/2019).

[39] Kaiming He et al. “Deep Residual Learning for Image Recognition”.In: CoRR abs/1512.03385 (2015). arXiv: 1512 . 03385. url: http :

//arxiv.org/abs/1512.03385.

[40] FFmpeg. https://ffmpeg.org/. Accessed: 2019-05-14.

59

http://paperjs.org/about/

http://paperjs.org/about/




https://ffmpeg.org/

Appendices

A Back End API

Table 11: Back end REST API.

Route /api/categoriesMethod type GETReturns [ ]Category objects

Route /api/categories/< category id >Method type GETReturns Object;

Route /api/detectMethod type POSTParameters Object; {image url:str, dataset:str}Returns Object;??

Route /api/predictMethod type POSTParameters Object;??Returns Object;??

Route /api/productsMethod type GETReturns [ ]Product objects

Route /api/products/category/< category id >Method type GETReturns [ ]Product objects

Route /api/product/< product id >Method type GETReturns Object;??

Route /api/product/< product id >/< image idx >/annotationsMethod type POSTParameters Object;??Returns Object;??

I

Route /api/statsMethod type GET

ReturnsObject; {nodes:[ ]Node objects, num nodes:int, idle:int,

working on train:int, working on detect:int,backlog train:int, backlog detect:int}

Route /api/trainMethod type POSTParameters Object; {epochs:int, training steps:int, validation steps:int}Returns Object;??

B Translation of Category Ids from Modanet’s

to Spotin’s

Table 12: Mapping of category ids from Modanet dataset to similar categoriesin Spotin dataset

Modanet name Modanet Id Spotin Name Spotin Idbag 1 bag 52belt 2 belt 53boots 3 boots 50footwear 4 shoes 89outer 5 jacket 77dress 6 dress 83sunglasses 7 sunglasses 5pants 8 pants 86top 9 top 80shorts 10 shorts 82skirt 11 skirt 87headwear 12 hat 84scarf & tie 13 accessory 88

C Training on Modanet Dataset With ”all”

Layer, COCO Weight for TL

II

Epoch Modanet dataset Spotin dataset

1 0.2948431826 0.052884615382 0.3664384968 0.076923076923 0.3474740591 0.086538461544 0.4112729004 0.060096153855 0.4392452744 0.10576923086 0.4573153292 0.076923076927 0.4621527842 0.086538461548 0.4432413821 0.16346153859 0.4712988471 0.0913461538510 0.5093215874 0.0865384615411 0.4781688868 0.0865384615412 0.4662885444 0.062513 0.5192792345 0.0865384615414 0.4625076371 0.0913461538515 0.4786203713 0.0865384615416 0.5185451075 0.0913461538517 0.5278178488 0.0865384615418 0.4910466346 0.0817307692319 0.5627732075 0.0865384615420 0.4894116375 0.0865384615421 0.5368940859 0.100961538522 0.4994839441 0.0865384615423 0.530944567 0.115384615424 0.5209817669 0.110576923125 0.5073527235 0.0817307692326 0.5177741606 0.100961538527 0.5289312492 0.144230769228 0.5674622324 0.134615384629 0.4977995334 0.139423076930 0.5363858431 0.0961538461531 0.5341712515 0.115384615432 0.5066598054 0.0961538461533 0.5344322401 0.115384615434 0.5462347443 0.120192307735 0.5515399942 0.0961538461536 0.5806608732 0.105769230837 0.5436843784 0.0769230769238 0.5675866213 0.09134615385

III


39 0.552314185 0.0865384615440 0.5526556844 0.105769230841 0.5668208257 0.0673076923142 0.5543727166 0.115384615443 0.5867952597 0.0977564102644 0.5771709086 0.0961538461545 0.550114475 0.120192307746 0.5587748081 0.100961538547 0.569027786 0.134615384648 0.6171951394 0.0769230769249 0.5817380248 0.0961538461550 0.5981536239 0.0961538461551 0.5710378573 0.12552 0.595478677 0.10897435953 0.5901275438 0.105769230854 0.5784840578 0.110576923155 0.58419262 0.139423076956 0.5700303944 0.166666666757 0.5931875828 0.139423076958 0.6076523054 0.0865384615459 0.5830051959 0.100961538560 0.596393855 0.0913461538561 0.5594093478 0.134615384662 0.584823343 0.0961538461563 0.5435676187 0.134615384664 0.6126495814 0.0865384615465 0.5770799048 0.129807692366 0.5787191001 0.0913461538567 0.5967033033 0.129807692368 0.5852506937 0.0961538461569 0.5852133005 0.0913461538570 0.5840167189 0.139423076971 0.6106416028 0.133012820572 0.6013461606 0.115384615473 0.61006449 0.158653846274 0.6291273732 0.0865384615475 0.5861988767 0.153846153876 0.6042712219 0.173076923177 0.6030575648 0.1185897436

IV


78 0.6127407724 0.0849358974479 0.572140193 0.184294871880 0.6155715875 0.158653846281 0.5991399636 0.0913461538582 0.6037320726 0.0865384615483 0.6073763799 0.177884615484 0.5954834465 0.0913461538585 0.608366972 0.0913461538586 0.6069768845 0.0865384615487 0.6032005571 0.0913461538588 0.6432509229 0.158653846289 0.6279792497 0.0945512820590 0.6041397452 0.0817307692391 0.6222323133 0.129807692392 0.6201694201 0.129807692393 0.6000953967 0.12594 0.6451693242 0.105769230895 0.6353205198 0.168269230896 0.5886744562 0.129807692397 0.5978830949 0.100961538598 0.6283056492 0.153846153899 0.6319362194 0.1298076923100 0.6294478837 0.1233974359

D Training on Modanet Dataset With ”heads”

Layer, COCO Weight for TL


1 0.2252146342 0.067307692312 0.221033658 0.091346153853 0.2837549657 0.1254 0.289149311 0.1255 0.2859184271 0.091346153856 0.3497134531 0.11057692317 0.3227852245 0.1258 0.3600373996 0.125

V


9 0.3651366055 0.120192307710 0.3065804389 0.12511 0.3033501278 0.12512 0.3654399487 0.115384615413 0.3413614221 0.12514 0.3534159482 0.12515 0.3627224569 0.12516 0.396032135 0.120192307717 0.3298401315 0.129807692318 0.3939852405 0.12519 0.4109031664 0.163461538520 0.4129225874 0.139423076921 0.4397115461 0.12522 0.40562577 0.12523 0.4116468868 0.149038461524 0.4033165511 0.12525 0.4055326688 0.129807692326 0.4005593784 0.153846153827 0.4144860418 0.134615384628 0.4763934748 0.12529 0.4109384609 0.163461538530 0.4675158423 0.144230769231 0.4341357671 0.134615384632 0.3993131941 0.139423076933 0.4672119275 0.129807692334 0.4419169794 0.153846153835 0.4190104243 0.139423076936 0.4273580659 0.134615384637 0.4724767309 0.134615384638 0.4183749849 0.129807692339 0.4503155595 0.134615384640 0.4557333705 0.166666666741 0.4273569204 0.163461538542 0.4172123076 0.129807692343 0.4537717562 0.12544 0.4420547775 0.139423076945 0.4858314325 0.129807692346 0.4305288535 0.129807692347 0.4807182985 0.1346153846

VI


48 0.4693610807 0.12549 0.4582696182 0.129807692350 0.4151833483 0.134615384651 0.4140576015 0.129807692352 0.4846985728 0.134615384653 0.4541952901 0.168269230854 0.474448649 0.163461538555 0.4999752063 0.134615384656 0.4626404212 0.192307692357 0.4259172837 0.173076923158 0.4840422854 0.139423076959 0.4758211301 0.144230769260 0.4782768691 0.139423076961 0.4477392474 0.158653846262 0.4747514197 0.134615384663 0.4635424359 0.149038461564 0.4596741528 0.144230769265 0.5036305781 0.142628205166 0.4835031351 0.129807692367 0.4705881599 0.139423076968 0.5271329442 0.158653846269 0.4920116075 0.139423076970 0.4906685049 0.137820512871 0.4677228395 0.12572 0.5073321206 0.134615384673 0.5214743665 0.134615384674 0.4887961766 0.139423076975 0.4942949337 0.182692307776 0.4841412736 0.134615384677 0.4920219086 0.139423076978 0.4580651016 0.168269230879 0.5280948632 0.168269230880 0.5155242754 0.134615384681 0.4854681847 0.134615384682 0.475603259 0.134615384683 0.4923863011 0.134615384684 0.5115686136 0.153846153885 0.4882188716 0.129807692386 0.4775227099 0.1442307692

VII


87 0.5189736039 0.129807692388 0.4815926499 0.139423076989 0.5046718627 0.134615384690 0.5454792504 0.129807692391 0.4686248532 0.163461538592 0.5231349276 0.12593 0.5214426954 0.152243589794 0.5274246197 0.134615384695 0.5024374318 0.12596 0.5044053148 0.134615384697 0.4991727785 0.144230769298 0.4817468025 0.139423076999 0.4706091715 0.1538461538100 0.5048546316 0.1298076923

E Training on Spotin Dataset With ”all” Lay-

ers, COCO and Modanet #94 for TL

Epoch COCO Modanet94

1 0 02 0 0.048076923083 0 04 0 0.26923076925 0 0.048076923086 0.1442307692 0.48076923087 0 0.42307692318 0 0.48076923089 0.0625 0.567307692310 0 0.567307692311 0.2692307692 0.567307692312 0.2692307692 0.567307692313 0.2115384615 0.567307692314 0.375 0.567307692315 0.1442307692 0.567307692316 0.2692307692 0.567307692317 0.375 0.5673076923

VIII


18 0.5192307692 0.567307692319 0.5192307692 0.567307692320 0.375 0.567307692321 0.3557692308 0.567307692322 0.375 0.567307692323 0.2692307692 0.567307692324 0.375 0.567307692325 0.4615384615 0.567307692326 0.4615384615 0.567307692327 0.3557692308 0.567307692328 0.5192307692 0.567307692329 0.5192307692 0.567307692330 0.375 0.567307692331 0.5192307692 0.567307692332 0.5192307692 0.567307692333 0.5192307692 0.567307692334 0.5192307692 0.567307692335 0.5192307692 0.567307692336 0.5192307692 0.567307692337 0.5192307692 0.567307692338 0.5192307692 0.567307692339 0.5192307692 0.567307692340 0.5192307692 0.567307692341 0.5192307692 0.567307692342 0.5192307692 0.567307692343 0.5192307692 0.567307692344 0.5192307692 0.567307692345 0.4615384615 0.567307692346 0.5192307692 0.567307692347 0.5192307692 0.567307692348 0.5192307692 0.567307692349 0.5192307692 0.567307692350 0.5192307692 0.567307692351 0.5192307692 0.567307692352 0.5192307692 0.567307692353 0.5192307692 0.567307692354 0.5192307692 0.567307692355 0.5192307692 0.567307692356 0.5192307692 0.5673076923

IX


57 0.5192307692 0.567307692358 0.5192307692 0.567307692359 0.5192307692 0.567307692360 0.5192307692 0.567307692361 0.5192307692 0.567307692362 0.5192307692 0.567307692363 0.5192307692 0.567307692364 0.5192307692 0.567307692365 0.5192307692 0.567307692366 0.5192307692 0.567307692367 0.5192307692 0.567307692368 0.5192307692 0.567307692369 0.5192307692 0.567307692370 0.5192307692 0.567307692371 0.5192307692 0.567307692372 0.5192307692 0.567307692373 0.5192307692 0.567307692374 0.5192307692 0.567307692375 0.5192307692 0.567307692376 0.5192307692 0.567307692377 0.5192307692 0.567307692378 0.5192307692 0.567307692379 0.5192307692 0.567307692380 0.5192307692 0.567307692381 0.5192307692 0.567307692382 0.5192307692 0.567307692383 0.5192307692 0.567307692384 0.5192307692 0.567307692385 0.5192307692 0.567307692386 0.5192307692 0.567307692387 0.5192307692 0.567307692388 0.5192307692 0.567307692389 0.5192307692 0.567307692390 0.5192307692 0.567307692391 0.5192307692 0.567307692392 0.5192307692 0.567307692393 0.5192307692 0.567307692394 0.5192307692 0.567307692395 0.5192307692 0.5673076923

X


96 0.5192307692 0.567307692397 0.5192307692 0.567307692398 0.5192307692 0.567307692399 0.5192307692 0.5673076923100 0.5192307692 0.5673076923

F Training on Spotin Dataset With ”all” Lay-


Epoch Modanet94

1 014 027 040 053 0.0192307692366 079 092 0105 0.07692307692118 0.1057692308131 0.125144 0.125157 0.1153846154170 0.07692307692183 0.1346153846196 0.2115384615209 0.2115384615222 0.2019230769235 0.1730769231248 0.375261 0.2211538462274 0.2836538462287 0.2980769231300 0.3413461538313 0.3365384615326 0.25

XI

Epoch Modanet94

339 0.3365384615352 0.3846153846365 0.4326923077378 0.3653846154391 0.4326923077404 0.3365384615417 0.3557692308430 0.4134615385443 0.3365384615456 0.4038461538469 0.4134615385482 0.4038461538495 0.4711538462508 0.4519230769521 0.4038461538534 0.3653846154547 0.4423076923560 0.3028846154573 0.4567307692586 0.3269230769599 0.4711538462612 0.3894230769625 0.3413461538638 0.4038461538651 0.4567307692664 0.4519230769677 0.4230769231690 0.4615384615703 0.4663461538716 0.4663461538729 0.4375742 0.4663461538755 0.4903846154768 0.4230769231781 0.4519230769794 0.4759615385807 0.4423076923820 0.4567307692833 0.4519230769

XII

Epoch Modanet94

846 0.4423076923859 0.4375872 0.4663461538885 0.4326923077898 0.4711538462911 0.4471153846924 0.4567307692938 0.4519230769950 0.4567307692963 0.4423076923976 0.4326923077989 0.44711538461000 0.4423076923

G Training on Spotin Dataset With ”all” Lay-


Epoch Modanet94

1 011 021 031 041 051 061 071 081 0.00961538461591 0.05769230769101 0.09615384615111 0121 0.07692307692131 0.1057692308141 0.1153846154151 0.2211538462161 0.1057692308171 0.1634615385

XIII

Epoch Modanet94

181 0.1442307692191 0.08653846154201 0.2019230769211 0.2019230769221 0.3028846154231 0.2403846154241 0.3269230769251 0.2692307692261 0.2980769231271 0.3269230769281 0.3701923077291 0.3269230769301 0.2980769231311 0.3653846154321 0.3557692308331 0.3173076923341 0.3076923077351 0.3942307692361 0.3653846154371 0.3076923077381 0.4038461538391 0.375401 0.4182692308411 0.4134615385421 0.375431 0.375441 0.4038461538451 0.4134615385461 0.3653846154471 0.3942307692481 0.3461538462491 0.3942307692501 0.4038461538511 0.4278846154521 0.4134615385531 0.375541 0.3365384615551 0.3846153846561 0.3846153846

XIV

Epoch Modanet94

571 0.3846153846581 0.3557692308591 0.375601 0.4038461538611 0.3653846154621 0.3557692308631 0.3365384615641 0.3365384615651 0.3365384615661 0.3942307692671 0.3653846154681 0.3461538462691 0.3461538462701 0.3461538462711 0.3365384615721 0.375731 0.375741 0.3653846154751 0.3653846154761 0.4182692308771 0.3557692308781 0.3557692308791 0.3798076923801 0.3990384615811 0.4134615385821 0.3798076923831 0.3894230769841 0.3894230769851 0.3461538462861 0.3557692308871 0.3557692308881 0.3461538462891 0.3557692308901 0.3365384615911 0.3365384615921 0.3365384615931 0.3557692308941 0.375951 0.3365384615

XV

Epoch Modanet94

961 0.3365384615971 0.3557692308981 0.3557692308991 0.33653846151001 0.37019230771011 0.40865384621021 0.3751031 0.33653846151041 0.3751051 0.42307692311061 0.32692307691071 0.3751081 0.3751091 0.3751101 0.34615384621111 0.40384615381121 0.39903846151131 0.3751141 0.3751151 0.37980769231161 0.3751171 0.36538461541181 0.3751191 0.34615384621201 0.35576923081211 0.31730769231221 0.33173076921231 0.3751241 0.35096153851251 0.3751261 0.3751271 0.34615384621281 0.40865384621291 0.34615384621301 0.35096153851311 0.3751321 0.40865384621331 0.40865384621341 0.3461538462

XVI

Epoch Modanet94

1351 0.40384615381361 0.43269230771371 0.40865384621381 0.40384615381391 0.32692307691401 0.40384615381411 0.43269230771421 0.3751431 0.40384615381441 0.43269230771451 0.40384615381461 0.40384615381471 0.43269230771481 0.38461538461491 0.3751501 0.40384615381511 0.40384615381521 0.41346153851531 0.40384615381541 0.49038461541551 0.40384615381561 0.40384615381571 0.48076923081581 0.49038461541591 0.40384615381601 0.48076923081611 0.43269230771621 0.41346153851631 0.41346153851641 0.40384615381651 0.40384615381661 0.49038461541671 0.44230769231681 0.49038461541691 0.44230769231701 0.40384615381711 0.40384615381721 0.40384615381731 0.4903846154

XVII

Epoch Modanet94

1741 0.44230769231751 0.41346153851761 0.44230769231771 0.40384615381781 0.40384615381791 0.40384615381801 0.40384615381811 0.40384615381821 0.3751831 0.40384615381841 0.41346153851851 0.40384615381861 0.43269230771871 0.40384615381881 0.40384615381891 0.40384615381901 0.43269230771911 0.40384615381921 0.40384615381931 0.40384615381941 0.40384615381951 0.40384615381961 0.40384615381971 0.40384615381981 0.43269230771991 0.40384615382001 0.42307692312011 0.41346153852021 0.29807692312031 0.27884615382041 0.30288461542051 0.38461538462061 0.29807692312071 0.41346153852081 0.3752091 0.3752101 0.3752111 0.33653846152121 0.3365384615

XVIII

Epoch Modanet94

2131 0.33653846152141 0.3752151 0.29807692312161 0.3752171 0.29807692312181 0.30288461542191 0.3752201 0.41346153852211 0.29807692312221 0.23076923082231 0.29807692312241 0.3752251 0.34615384622261 0.29807692312271 0.29807692312281 0.29807692312291 0.33653846152301 0.29807692312311 0.31730769232321 0.26923076922331 0.30769230772341 0.34134615382351 0.31730769232361 0.252371 0.3752381 0.32692307692391 0.3752401 0.3752411 0.3752421 0.3752431 0.29807692312441 0.3752451 0.3752461 0.3752471 0.3752481 0.3752491 0.34615384622500 0.375

XIX

H Timing Performance of Detection on Dif-

ferent Hardware

Node hardware type Computation unit Memory Time (s)GPU Nvidia GTX 1070 8GB 0.95GPU Nvidia GTX 1070 8GB 1.06GPU Nvidia GTX 1070 8GB 1.12GPU Nvidia GTX 1070 8GB 1.20GPU Nvidia GTX 1070 8GB 0.94

GPU Nvidia GTX 1050 2GB 1.41GPU Nvidia GTX 1050 2GB 1.27GPU Nvidia GTX 1050 2GB 1.26GPU Nvidia GTX 1050 2GB 1.15GPU Nvidia GTX 1050 2GB 1.30

CPU Ryzen 1600 8 cores 8GB 5.09CPU Ryzen 1600 8 cores 8GB 5.12CPU Ryzen 1600 8 cores 8GB 5.05CPU Ryzen 1600 8 cores 8GB 5.42CPU Ryzen 1600 8 cores 8GB 5.03



I Prediction With Uniform as Weight Scale

idx label p predict p hit p in hits h predict h hit h in hits

1 1266 1065 0 1 1009 0 12 1444 1074 0 2 1074 0 23 974 1161 0 3 927 0 3

XX


4 1200 1161 0 4 1009 0 45 1376 577 0 5 917 0 56 924 517 0 6 927 0 67 1010 1161 0 7 1009 0 78 1342 517 0 8 1161 0 89 1208 1161 0 9 1459 0 910 1356 1266 0 10 1455 0 1011 991 517 0 11 517 0 1112 1074 1074 1 12 1373 0 1213 967 1373 1 13 1373 0 1314 1162 1009 1 14 1459 0 1415 1306 1455 1 15 1009 0 1516 516 1455 1 16 516 1 1617 1065 577 1 17 1050 1 1718 1120 1065 1 18 1050 1 1819 1156 1373 1 19 1074 1 1920 1161 1161 2 20 1161 2 2021 1200 577 2 21 1161 2 2122 1457 1161 2 22 1161 2 2223 1104 517 2 23 1459 2 2324 1266 577 2 24 1161 2 2425 1342 517 2 25 1161 2 2526 1466 1074 2 26 1009 2 2627 1167 1161 2 27 927 2 2728 1345 517 2 28 892 2 2829 1492 1036 2 29 1050 2 2930 1454 1161 2 30 1009 2 3031 1104 517 2 31 927 2 3132 1474 1065 2 32 1009 2 3233 634 577 2 33 1050 2 3334 1300 1161 2 34 1009 2 3435 1023 517 2 35 1009 2 3536 1393 517 2 36 892 2 3637 634 517 2 37 634 3 3738 1373 1074 2 38 1459 3 3839 1065 1455 2 39 1009 3 3940 1345 517 2 40 892 3 4041 577 577 3 41 1050 3 4142 634 517 3 42 517 3 42

XXI


43 1513 1161 3 43 892 3 4344 1455 1161 3 44 1161 3 4445 938 517 3 45 517 3 4546 898 517 3 46 892 3 4647 1348 890 3 47 967 3 4748 1165 1455 3 48 927 3 4849 1373 1373 4 49 1074 3 4950 1509 1074 4 50 1074 3 5051 1050 577 4 51 1161 3 5152 1076 1065 4 52 1009 3 5253 1213 1455 4 53 517 3 5354 1465 1161 4 54 1161 3 5455 1065 1161 4 55 1455 3 5556 938 517 4 56 517 3 5657 967 1161 4 57 892 3 5758 1513 1161 4 58 917 3 5859 517 517 5 59 517 4 5960 1076 1161 5 60 1009 4 6061 1009 1065 5 61 1009 5 6162 890 890 6 62 890 6 6263 517 517 7 63 517 7 6364 1072 1036 7 64 1050 7 6465 1160 1065 7 65 1373 7 6566 1478 517 7 66 634 7 6667 1161 1161 8 67 1161 8 6768 859 517 8 68 917 8 6869 865 1161 8 69 917 8 6970 1171 634 8 70 517 8 7071 1455 517 8 71 927 8 7172 1455 517 8 72 1161 8 7273 890 517 8 73 634 8 7374 929 1161 8 74 1459 8 7475 1006 1161 8 75 917 8 7576 891 517 8 76 927 8 7677 1053 1161 8 77 634 8 7778 1443 517 8 78 892 8 7879 1168 1455 8 79 1009 8 7980 1213 1455 8 80 1161 8 8081 1009 577 8 81 892 8 81

XXII


82 1036 1074 8 82 1373 8 8283 901 517 8 83 1161 8 8384 1461 1161 8 84 1455 8 8485 1078 1161 8 85 634 8 8586 519 517 8 86 634 8 8687 1170 634 8 87 517 8 8788 1173 1161 8 88 1074 8 8889 925 518 8 89 892 8 8990 1508 1455 8 90 917 8 9091 1141 1074 8 91 1373 8 9192 1009 577 8 92 1009 9 9293 1074 1266 8 93 1455 9 9394 934 517 8 94 1068 9 9495 1104 517 8 95 917 9 9596 751 517 8 96 634 9 9697 1373 1074 8 97 1036 9 9798 967 1373 8 98 1373 9 9899 1194 518 8 99 927 9 99100 1049 1065 8 100 1009 9 100result - - 0.08 1 - 0.09 1

J Prediction With Distance as Weight Scale


1 634 634 1 1 634 1 12 516 516 2 2 516 2 23 1457 1457 3 3 1457 3 34 1461 1461 4 4 1461 4 45 1460 1460 5 5 1460 5 56 1208 1208 6 6 1208 6 67 1050 1050 7 7 1050 7 78 634 634 8 8 634 8 89 917 917 9 9 917 9 910 1065 1065 10 10 1065 10 1011 898 898 11 11 898 11 1112 1050 1050 12 12 1050 12 1213 1478 1478 13 13 1478 13 13

XXIII


14 828 828 14 14 828 14 1415 1459 1459 15 15 1459 15 1516 1307 1307 16 16 1307 16 1617 1266 1266 17 17 1266 17 1718 1072 1072 18 18 1072 18 1819 1312 1312 19 19 1312 19 1920 1058 1058 20 20 1058 20 2021 1458 1458 21 21 1458 21 2122 1068 1068 22 22 1068 22 2223 977 977 23 23 977 23 2324 1065 1065 24 24 1065 24 2425 1266 1266 25 25 1266 25 2526 577 577 26 26 577 26 2627 1485 1485 27 27 1485 27 2728 1187 1187 28 28 1187 28 2829 1345 1345 29 29 1345 29 2930 1141 1141 30 30 1141 30 3031 1187 1187 31 31 1187 31 3132 1304 1304 32 32 1304 32 3233 1266 1266 33 33 1266 33 3334 1507 1507 34 34 1507 34 3435 1023 1023 35 35 1023 35 3536 925 925 36 36 925 36 3637 634 634 37 37 634 37 3738 517 517 38 38 517 38 3839 1356 1356 39 39 1356 39 3940 925 925 40 40 925 40 4041 1390 1390 41 41 1390 41 4142 1065 1065 42 42 1065 42 4243 1104 1104 43 43 1104 43 4344 1200 1200 44 44 1200 44 4445 1484 1484 45 45 1484 45 4546 1513 1513 46 46 1513 46 4647 1208 1208 47 47 1208 47 4748 1031 1031 48 48 1031 48 4849 517 517 49 49 517 49 4950 1444 1444 50 50 1444 50 5051 1160 1160 51 51 1160 51 5152 1065 1065 52 52 1065 52 52

XXIV


53 1200 1200 53 53 1200 53 5354 1036 1036 54 54 1036 54 5455 1193 1193 55 55 1193 55 5556 634 634 56 56 634 56 5657 1465 1465 57 57 1465 57 5758 1105 1105 58 58 1105 58 5859 517 517 59 59 517 59 5960 1065 1065 60 60 1065 60 6061 1456 1456 61 61 1456 61 6162 926 926 62 62 926 62 6263 1373 1373 63 63 1373 63 6364 517 517 64 64 517 64 6465 577 577 65 65 577 65 6566 1478 1478 66 66 1478 66 6667 1161 1161 67 67 1161 67 6768 1000 1000 68 68 1000 68 6869 1009 1009 69 69 1009 69 6970 1049 1049 70 70 1049 70 7071 1208 1208 71 71 1208 71 7172 934 934 72 72 934 72 7273 927 927 73 73 927 73 7374 1213 1213 74 74 1213 74 7475 516 516 75 75 516 75 7576 898 898 76 76 898 76 7677 1508 1508 77 77 1508 77 7778 1393 1393 78 78 1393 78 7879 1167 1167 79 79 1167 79 7980 929 929 80 80 929 80 8081 1009 1009 81 81 1009 81 8182 1161 1161 82 82 1161 82 8283 1210 1210 83 83 1210 83 8384 828 828 84 84 828 84 8485 1078 1078 85 85 1078 85 8586 1171 1171 86 86 1171 86 8687 1480 1480 87 87 1480 87 8788 1120 1120 88 88 1120 88 8889 1497 1497 89 89 1497 89 8990 1508 1508 90 90 1508 90 9091 1454 1454 91 91 1454 91 91

XXV


92 1299 1299 92 92 1299 92 9293 522 522 93 93 522 93 9394 892 892 94 94 892 94 9495 1212 1212 95 95 1212 95 9596 751 751 96 96 751 96 9697 1120 1120 97 97 1120 97 9798 772 772 98 98 772 98 9899 967 967 99 99 967 99 99100 1171 1171 100 100 1171 100 100result - - 1 1 - 1 1

K Timing Performance of Prediction on Dif-

ferent Hardware

Index Time (s)1 0.5382 0.5363 0.5004 0.5355 0.515

XXVI

metadata validation using a convolutional neural network1335777/... · 2019-07-07 · metadata...

Documents