dynamic routing between capsules - heidelberg university...this can be interpreted as follows: the...

Report

Explainable Machine Learning

Dynamic Routing Between Capsules

Author:Michael Dorkenwald

Supervisor:Dr. Ullrich Kothe

28. Juni 2018

Inhaltsverzeichnis

1 Introduction 2

2 Motivation 2

3 CapusleNet 33.1 Capsules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3 Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 Example How the Routing Algorithm Works . . . . . . . . . . . . . . . . . . . . . . . 53.5 Margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.6 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Results 74.1 MNIST Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1.1 Instantation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 MultiMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 CIFAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Discussion 9

1

1 Introduction

Artificial neural networks may be the hottest topic in Machine Learning. In the last few years, therewere a lot of new developments that have enhanced neural networks and made them more accessible.However, they were mostly incremental, like adding more layers or improving types of layers like BatchNormalization, but did not introduce a new type of architecture or topic. That is the reason why thispaper is really interesting because on the one hand it is published by Geoffrey Hinton and his teamand on the other, it introduces a completely new architecture based on capsules. Hinton is one ofthe founders of deep learning and an inventor of various models and algorithms that are widely usedtoday. He has had the idea of capsules for quite a long time and has finally managed to publish afunctioning network that achieves state-of-the-art performance on MNIST.

2 Motivation

Convolutional Neural Network (ConvNet or CNN) is a specific type of deep neural networks in whicha model learns to perform classification tasks directly from images or videos. They are useful in findingpatterns in images to recognize objects, faces, and scenes. ConvNets have been successful in identifyingfaces, objects and traffic signs, which is an important component for powering vision in robots andself-driving cars.During training the different layers of the Convolutional Neural Network learn some different typesof features. The convolutional layers that are closer to the input learn low-level features like edges orcolor gradients. The convolutional layers which are close to the fully connected layers (output) learnhigh-level features. These high-level features are combinations of low-level features. The dense layerscombine these high-level features and produce a classification task. This is shown in the figure below.

Abbildung 1: Differnt Types of Features of a CNN2

For example, if you want to classify a ship or a horse the innermost layer understands the small curvesand edges. The 2nd layer might understand the straight lines or the smaller shapes, like the mast ofa ship or the curvature of the entire tail. Higher up layers start understanding more complex shapeslike the entire tail or the ship hull. Final layers try to see a more holistic picture like the entire shipor the entire horse.In Convolutional Neural Network we use Max-Pooling to reduce the spatial size of the features. Becauseof this it computes in a reasonable time and avoids overfitting due to the fact that it decreases thenumber of neurons in the network. On the other hand, it extracts the dominant features of a specific

2

field of view. Thereby we lose the spatial information where this value comes from and because of this,we have a small invariance in the change of the viewpoint. This is illustrated in the following figure.

Abbildung 2: Differnt Types of Features of a CNN3

Max-Pooling is a crutch that makes CNN works really well. When we take a look on the performance ofConvolutional Neural Networks we see that the error for nearly all datasets decreases over the last yearand achieve superhuman performance in several benchmarks. But the loss of spatial relation is a thornin the flesh for Geoffrey Hinton. He says that: ’The pooling operation used in convolutionalneural networks is a big mistake and the fact that it works so well is a disaster’.4 Hintonis looking for equivariance, this means changes in viewpoint leads to corresponding changes in neuralactivities. Therefore we need a different architecture which does not use Max-Pooling layers.

3 CapusleNet

3.1 Capsules

A capsule is a small group of neurons that learns to detect a particular object (e.g., a rectangle) withina given region of the image. There are a lot of different ways to implement the basic idea of capsules.They decided that the output vector of a capsule should represent the probability that a specific patternor object is present in the input image. These specific patterns are the instantiation parameters ofa specific entity and is learned from the network. In the results, the individual dimensions of theoutput vector are shown. They used a ’squashing’ function to ensure that the output represents aprobability and to have a non-linearity in their network. The difference of this non-linearity comparedto one of a neural network like a ReLu or sigmoid is that it is applied to the whole capsule instead ofeach neuron separately. The ’squashing’ is defined as:

vj =||sj ||2

1 + ||sj ||2sj||sj ||2

with vj the output vector of capsule j and sj the input.

3.2 Architecture

Like regular neural networks, the Capsule Network consists of multiple layers but is shallow whichmeans it consists only of two convolutional layers and one fully connected layer. A simple CapsNetarchitecture is shown in the figure below.The first layer is a standard convolutional layer with 256 channels and a 9x9 window and is followedfrom a ReLU activation. The job of the first layer is to detect basic features in the 2D input image.Therefore it transforms the pixel values from the input image to local feature activities which are theinput for the first primary capsules.The second layer is a convolutional capsule layer with 32 primary capsules whose job is to combine thebasic features which are detected in the first layer. Each capsule applies eight convolutional kernelsto the input volume and creates a 6x6x8 output tensor. Since there are 32 such capsules, the output

3

volume has the shape of 6x6x8x32.The third and last layer consists of the 10 digit capsules, one for each digit (MNIST dataset). Eachdigit capsule takes as input the 6x6x32 8D vectors and output for each digit a 16D capsule. Betweenthese two layers, we have a routing algorithm that is where all the magic of this paper appears.

Abbildung 3: Architecture of the Capsule Network1

3.3 Routing Algorithm

Between the two capsule layers, the routing algorithm is applied. The algorithm follows a specificprocedure which can be seen in figure 7. This procedure can also be applied to Capsule Networks withmore capsule layers.For all capsule layers except the first one, they apply a weight Matrix Wij to the output of the previouscapsule layer, to encode important relationships between lower level features (e.g. mouth, nose) andhigher level feature (e.g. face).

uj|i = Wijui

uj|i is the ’prediction vector’ which is an estimation what the output of the layer would be and ui isthe output of the previous layer. With the help of the ’prediction vector’ we can compute the outputof the capsule which is a weighted sum over all these vectors

sj =∑i

cijui|j

vj = squash(sj)

where the cij are the coupling coefficients. The ultimate output is vj . At the first glance, this calcula-tion looks similar to the one where neurons weight its inputs before adding them up in fully connectedlayers. In the neuron case, these weights are updated during backpropagation, but in the case of cap-sules, they are determined by the dynamic routing process. This can be interpreted as follows: thedynamic routing algorithm determines where each capsule’s output goes.The coefficients of the coupling variable are normalized which means that between capsule i and all thecapsules in the layer above they sum to 1. Furthermore, they are determined by a ’routing softmax’whose initial logits bij are the log prior probabilities that capsule i should be coupled to capsule j.

cij =exp(bij)∑k exp(bik)

The bij can be learned at the same time as all other weights. They depend on the region and type ofthe two capsules but not on the current image. The initial coupling coefficients are iteratively updateddepending on the agreement which is defined as:

aij = vjuj|i

4

Therefore, the agreement score takes into account both likeliness and the feature properties, instead ofjust likeliness in neurons. Also, bij remains low if the activation ui of capsule i is low since uj|i length isproportional to ui, for example, bij should remain low between the mouth capsule and the face capsuleif the mouth capsule is not activated. The agreement can just be added to the log priors because atthe beginning of each iteration bij will be normalized with the ’routing softmax’. On the MNISTdataset, they showed that 1 to 3 iterations are sufficient. The dynamic routing is not a completereplacement of the backpropagation, because the transformation matrix Wij is still trained with it.We compute cij to quantify the connection between a capsule and its parent capsules and this valueis important but short lived. They re-initialize it to 0 for every data point before the dynamic routingcalculation. To calculate a capsule output, training or testing, you always have to redo the dynamicrouting calculation.

Abbildung 4: Routing Algorithm1

3.4 Example How the Routing Algorithm Works

In this example, we focus on a rectangle and a triangle capsule and with them, we can build a boat ora house. Furthermore, we reduce the number of the instantiation parameters to one which representsthe rotation. In the figure below you can see in the lowest window the input image which consistsof a boat (left) and a house (right). By passing the input image into the primary capsule layer thecorresponding capsules gets activated. The length of the vector corresponds to the probability thata rectangle or a triangle is present in the image and the rotation of the vector corresponds to therotation of the rectangle or triangle.

Abbildung 5: 2 layer capsule network6

For now, we focus only on the boat as an input image and look how the CapsNet proceeds with it.The rectangle and the triangle could be part of a house or a boat. Now we have to take the pose of

5

the rectangle into account and the output house or boat would have to be slightly rotated, left sideof the figure below. For the triangle the same, because of the rotation of it, the output house wouldstand almost upside down. As a result, the triangle and the rectangle strongly agree with the objectof a boat but disagree with the output of a house.

Abbildung 6: predict the presence and pose of objects based on the presence and pose of object parts6

Since it is very likely that the output is a boat, it would make sense to send the output of the primarycapsule more to the boat capsule and less to the house capsule. Because of this, the boat capsulereceives a more useful input signal. This is realized by updating the coupling coefficients. Because ofthe strong agreement, the weights of the boat will be big and the one of the house very small. In twoto three iterations the corresponding house capsule (top window fig 5) will shrink to a tiny vector andthe boat capsule will be huge which corresponds to a large probability that a boat is present in theinput image.

3.5 Margin Loss

The length of instantiation vector corresponds to the probability that a specific entity is present in theinput image. We would like the top-level capsule for digit class k to have a long instantiation vectorif and only if that digit is present in the image. To allow images with multiple labels (more than onedigit), they used a separate margin loss, Lk for each digit capsule, k:

where Tk = 1 if a digit of class k is present and m+ = 0.9, m− = 0.1 and λ = 0.5. The loss function isvery similar to the SVM loss function. During training, for each image, one loss value will be calculatedfor each of the 10 vectors according to the formula above and then the 10 values will be added togetherto calculate the final loss. This forces the model that the top-level capsule has a probability greaterthan 0.9 and the rest smaller than 0.1.

3.6 Reconstruction

They reconstruct the image from the correct digit capsule (16D vector), the other digit capsules aremasked out. Therefore, they use a decoder, like a typical one from an autoencoder. The decoder isused as a regularizer, it takes the output of the correct DigitCap as input and learns to recreate a 28by 28 pixels image, with the loss function being Euclidean distance between the reconstructed imageand the input image. As a result, the decoder forces the capsules to learn features that are useful forreconstructing the original image. The reconstruction loss is scaled down by a factor of 0.0005 so thatit does not dominate the margin loss.

6

Abbildung 7: Decoder Network1

4 Results

They evaluated their capsule network on different data sets and achieved for the MNIST and Mul-tiMNIST data set state of the art performance. The test error for these experiments are in the figurebelow.

Abbildung 8: CapsNet classification test accuracy1

4.1 MNIST Data Set

First, they evaluated their model on MNIST a dataset consisting of images with on digit. Theyonly shifted the images by up to 2 pixels in each direction with zero padding as data augmentation.The dataset consists of 60k training examples and 10k test examples. They compared their modelonly to those which used the same type of data augmentation. The figure above (figure 10) showsthe importance of the routing and reconstruction regularizer. Without the reconstruction part, thecapsule net with 3 iterations would not be better than the version with one iteration.As a baseline model, they used a CNN (convolutional neural network) with 3 convolutional layers with256,256 and 128 channels and 5x5 kernels with stride 1. The last convolutional layers are followed bytwo fully connected layers of size 328, 192. The last fully connected layer is connected with dropout toa 10 class softmax layer with cross-entropy loss. The baseline is also trained on 2-pixel shifted MNISTwith Adam optimizer. The capsule network achieved a test error of (0.25%) only on 3 layers similarresults are only achieved by deeper networks.

7

4.1.1 Instantation Parameters

They investigated what each individual dimension of a capsule represents. They feed in a perturbedversion of this activity vector and looked how this influenced the reconstructed image. This is shownin the figure below. One dimension of the digit capsule always represents the width of the digit. Otherdimensions could represent the localization, scale, and thickness. These are easier to interpret as thelayers from a standard convolutional neural network.

Abbildung 9: Dimension perturbations1

4.1.2 Robustness

They looked at the robustness of their model and compared it to a traditional convolutional neuralnetwork. Therefore, they applied small affine transformations on the MNIST images and investigatedthe impact of the results. They achieved 79 % accuracy, the traditional CNN with the same number ofparameters as the capsule network achieved only 66 %. The models are trained on the normal MNISTdataset and only evaluated on this one.

4.2 MultiMNIST

First of all, they had to create this dataset. Therefore, they took every image from the MNIST datasetand overlay it with all digits from a different class. Each digit is shifted up to 4 pixels in each directionresulting in a 36 × 36 image. They generate for each digit in the MNIST dataset 1K MultiMNISTexamples. As a result, the training set consists of 60 M images and the testing set of 10M. Examples ofimages can be seen in the figure below in the first row. For this experiment, they used a capsule network

Abbildung 10: MultiMNIST1

8

with 3 routing iterations. The two reconstructed digits are overlayed in green and red in the second row.L (l1, l2 ) are the labels for the input image and R (r1,r2) are the labels used for the reconstruction.The two rightmost columns show two images with wrong classification (R) reconstructed from theinput label and the predicted label. The other columns have correct classifications and show thatthe model accounts for all the pixels while being able to assign one pixel to two digits in extremelydifficult scenarios. They treated the two most active digit capsules as the classification produced bythe capsules network. For the reconstruction, they picked one digit at a time and used the activityvector of the chosen digit capsule to reconstruct the image of the chosen digit.Their 3 layer network achieved a better performance than the baseline model. The baseline modelsconsist of 2 convolutional layers (followed by max-pooling layers) and 2 fully connected layers forclassification. The number of parameters is 25 M of the baseline model, for the capsule net only 11M.

4.3 CIFAR10

They evaluated CapsNet on the CIFAR10 dataset and achieved a test error of 10.6 % with an ensembleof 7 models each trained with 3 routing iterations. The architecture is similar to the one for the MNISTdataset, they only increased the number of primary capsule because the input image has 3 channels.They stated that the standard CNN had the same error rate when they were first applied to thisdataset.

5 Discussion

The authors of the paper have presented a routing algorithm which provides a way how capsulesinteract with each other. This kind of network could perhaps replace one-day convolutional neuralnetworks. CNN’s have become dominant in computer vision tasks for instance object detection, butthere are signs that these may be replaced by other networks. One sign is, for instance, the difficultyin generalization to novel viewpoints. If the viewpoint changes in CapsNet the neural activities varycorrespondingly rather than eliminating the viewpoint variation from the neural activity like in CNN.On MNIST CapsNet has achieved state-of-the-art performance, but it has not yet been tested onlarge data sets such as Imagenet. They have also shown that their model produces better results onMultiMNIST as a regular convolutional neural network. The reconstructions illustrated that CapsNetis able to segment the image into the two original digits. This is very promising for later use in objectdetection. Furthermore, CapsNet also has shown better robustness to affine transformation than aregular convolutional neural network with the same number of parameters.Overall, the activation vectors are easier to interpret, this could be seen in fig 9 and the dimensionsrepresent for instance the thickness, scale or rotation. A problem with this routing algorithm is thetime it needs for training, because of the inner loop. However the research on capsule networks areon an early stage and there are good reasons for believing that it is a better approach as the currentnetworks, but it will take a lot of effort to out-perform a highly developed network.

9

Literatur

[1] Paper Dynamic Routing Between Capsules: https://arxiv.org/abs/1710.09829

[2] CNN Features Image: http://www.iro.umontreal.ca/ bengioy/talks/DL-Tutorial-NIPS2015.pdf

[3] Max Pooling Image: https://cs231n.github.io/convolutional-networks/#pool

[4] Quote Geoffrey Hinton: https://www.reddit.com/r/MachineLearning/comments/2lmo0l/amageoffrey hinton/clyj4jv/

[5] AlexNet image: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image segmentation.html

[6] Routing Example: https://www.oreilly.com/ideas/introducing-capsule-networks

10

dynamic routing between capsules - heidelberg university...this can be interpreted as follows: the...

Documents