documentation of research on deep learning

Documentation of Research on Deep Learning

Ideas, Results, and Analysis

This documentation is written by Zhengbo Li and revised by Professor John E. Hopcroft. It includes part of our ideas, resultsand analysis related to the research on deep learning.

Nonlinear eigenvector:

Neural networks are a family of statistical learning models inspired by biological neural networks (the central nervous systemsof animals, in particular the brain). In contrast to the complex structures in biological neural networks, people simplify thestructures of neural networks to connected layers of “neurons”, where each layer can only forward information to the next layer.The connections have numeric weights that can be tuned, making neural networks learn that mappings from inputs to outputs [1].When the input data and the output data are the same, the network is called an autoencoder. An autoencoder often has lesshidden gates in the middle, compared with the number of input or output gates. Thus an autoencoder is forced to learn somecompact representation of the original input data. Generally speaking, neural networks fall into two categories, depending onthe type of hidden gates they use. When the hidden gates perform a linear mapping, the networks are called linear networks.When the hidden gates perform a nonlinear mapping, such as the sigmoid function or the ReLU function, the networks are callednonlinear networks. Similarly there are linear autoencoders and nonlinear autoencoders.

Linear autoencoders have been well studied [2]. Here are some important conclusions [3]. Let X be the matrix where eachcolumn is an input pattern. For a linear autoencoder with a single hidden layer with k gates and a sum square error function,the optimum weight vectors must be equivalent to the top k eigenvectors of XXT . The optimum weight vectors are the onlylocal minimum, which means it is the global minimum. Thus an optimization algorithm such as the classic gradient descendalgorithm is guaranteed to find that unique global minimum. Although linear networks have been well understood analytically,few people use them in practice because they can only fit linear functions and thus can not learn complex functions. What’smore, linear networks with many layers can be collapsed into one layer, by multiplying these weight matrices in proper order.

Nonlinear neural networks have been extensively used in practice. People have made remarkable progress in recognizing picturesor sound with nonlinear networks. Some recent results beat human beings in classifying objects in pictures. Despite the hugesuccess of nonlinear neural networks, there are few analytical results to explain how nonlinear neural networks work. In thispaper, we developed a concept called nonlinear eigenvector to explain how a nonlinear autoencoder works.

Nonlinear networks can fit complex functions, making the networks more powerful but harder to understand. Take an autoencoderas an example. Generally speaking an autoencoder can be viewed as an encoding function combined with a decoding function.When trained from initial weights W1, an autoencoder learns to use function h1 to encode inputs and use function h−1

1 torecover inputs, but it may learn different functions h2 and h−1

2 to encode and recover inputs when started from a different setof initial weights W2. Thus random initial weights give different encode-decode functions, complicating the understanding ofautoencoders. What’s more, the weights lack properties that can be easily understood, which further complicate analysis. Thusit would be much better if we can get weights that

• are independent of random initial weights, and

• have easily understood properties.

Out of this consideration we developed the concept of nonlinear eigenvector. The concept of nonlinear eigenvector is inspired bythe classic eigenvector as follow. Training a linear autoencoder with k hidden gates gives weights that are equivalent to the topk eigenvectors, but not the top k eigenvectors themselves. However, training a linear autoencoder gate by gate is guaranteed togive the top k eigenvectors, because eigenvectors admit greedy algorithms. Thus it would be interesting to train the nonlinearautoencoders gate by gate to see what happens. As an analogy to the classic eigenvector, the ith weight vector trained in thisway is called the ith nonlinear eigenvector.

The nonlinear eigenvectors defined above depend on the specific activation functions used to train an autoencoder. For example,the activation function g(x) = x gives exactly the classic eigenvectors. The nonlinear eigenvectors trained from an activationfunction g are called g-eigenvectors to highlight the activation function. In practice, people often use the sigmoid function or

1

the ReLU function as the activation function. Since these two functions behave differently in our experiments, we discuss themseparately in the following.

Sigmoid Case:

Autoencoders with single hidden layer:

Based on the following two experiments, we tentatively draw the conclusion that, autoencoders with single hidden layer, k hiddengates learn the top i sigmoid-eigenvectors after training converges, 1 ≤ i ≤ k.

Experiment 1

In this experiment, we train an autoencoder with three hidden gates. Starting from random weights, Training converges to onlythree error values. These three values correspond to the errors of the top one, top two, and top three sigmoid-eigenvectors.Figure 1 is the experiment result.

Figure 1: error distribution for 3-gate autoencoders

Experiment 2

In this experiment we want to see whether an autoencoder learns a vector v by forcing the weight vectors to be perpendicular tov. If the error increases a lot with the constrain, the autoencoder learns v. If the error almost does not change, the autoencoderdoes not learn v, because “being perpendicular to v” is regraded as “preventing from learning v”. Figure 2 is the result for a3-gate autoencoder that lies on level 2 in Figure 1, when the weight vectors are forced to be perpendicular to the ith eigenvector,i = 1, 2, 3, 4.

2

Figure 2: nonlinear eigenvectors as perpendicular constrain

The results in Figure 2 show that for an autoencoder on level 2, error increases a lot when weight vectors are forced tobe perpendicular to the first or second sigmoid-eigenvector, but error does not change when weight vectors are forced to beperpendicular to the third or fourth sigmoid-eigenvector. It means an autoencoder on level 2 learns the top two sigmoid-eigenvectors.

We also did similar experiments for autoencoders on level 1 and level 3. An autoencoder on level i learns the top i sigmoid-eigenvectors, i = 1, 2, 3. Based on these two experiments we tentatively draw the above conclusion.

Nonlinear eigenvalue

Nonlinear eigenvectors have been defined above. It would be better to define nonlinear eigenvalues properly. Let Ei be the errorwhen a single hidden layer autoencoder with i hidden gates converges to its global minimum, i = 1, 2, 3, · · ·. For simplicity wedefine E0 as the sum square of the elements of the input matrix. That is equivalent to define “the output of a single hiddenlayer autoencoder with zero hidden gate is always zero”.

Let ei = Ei−1 −Ei, i = 1, 2, 3, · · ·. ei is the decreased error with the addition of the ith hidden gate. In the linear case, supposethe eigenvalues are λ1 ≥ λ2 ≥ λ3 ≥ · · ·. There is a theorem [2] saying

e1λ1

=e2λ2

=e3λ3

= · · ·

Above equation inspires one to (tentatively) define ei as the ith nonlinear eigenvalue when it comes to the nonlinear case. Figure3 is for the top 10 nonlinear and linear eigenvalues.

3

Figure 3: linear and nonlinear eigenvalues

Roughly speaking nonlinear and linear eigenvalues have similar distributions. Both of them have a turning point after the first,third, and fifth eigenvalues. Several top nonlinear eigenvalues are larger than their corresponding linear eigenvalues, implyingnonlinear autoencoders perform better than linear autoencoders, with a limited number of gates.

But the second nonlinear eigenvalue is smaller than the third one. It makes me feel strange.

Autoencoders with two hidden layers:

In practice people usually use more than one hidden layers to recognize pictures or speech, because it is generally believed thatmore hidden layers are better than one hidden layer. This subsection mainly deals with why two hidden layers are better thanone. All of the results in this subsection are base on autoencoders with one or two hidden layers, using sigmoid as the activationfunction and rectangles as the dataset.

Before answering why two layers are better, it is better to see how much two layers can beat one layer. Result in Figure 4 isabout autoencoders trained on all possible rectangles on the 10 by 10 grid. The x-axis represents the number of gates in eachlayer. The y-axis represents the error of one layer divided by the error of two layers, in order to compare them.

4

Figure 4: error comparison between one hidden layer and two hidden layers

Figure 4 shows that when there are only a few gates, say 3, two layers are almost the same as one layer. But when there aremore gates, say 25, two layers are better than one layer. John proposed a plausible explanation. I paraphrase it as follow.

Let f be the encoder function and g be the decode function. As the number of hidden gates increases, f has longer outputs, thusone reasonably expect that f becomes more complex. Thus we need a more complex g to decode f , which makes the secondlayer necessary.

(Before continue to write this part, I would like to study autoencoders trained on rectangles on 2 by 2 grid.)

Mathematical Properties of single hidden layer nonlinear autoencoders

In this section we do experiments using single hidden layer autoencoders with one, two, or three hidden gates. The activationfunction is sigmoid and the dataset is rectangles on the 10 by 10 grid. We also used a small λ to make sure that gates canlearn what they want. A λ is necessary because it can make sure the weights will converge even if there are more weights thansamples.

Figure 5 is for the error distribution of one, two, and three gates.

5

Figure 5: error distribution for one, two, and three gates

Following are some interesting properties:

• For all of the autoencoders in Figure 5, their weights are orthogonal. That is, if wi and wj are the weight vectors for gatei and gate j in the same autoencoder, the inner product of wi and wj is zero, provided i 6= j.

• For the autoencoders that lie on Line i and have i hidden gates, their weight vectors are the same when we allow permutationbetween gates and different signs.

• If an autoencoder have n hidden gates and lies on Line m, it has n−m zeros weight vectors.

• For any two autoencoders on the same line, if we remove their zero weight vectors, their remaining weight vectors are thesame when we allow permutation between gates and different signs.

Experimentally, when we say two values are the same, we mean they share at least six common digits. They are so close to eachother that we tend to believe mathematically they are exactly the same.

Orthogonal Weights

We observed that training on single hidden layer autoencoders with sigmoid activation function leads to approximately orthogonalweights. We have done experiments with 2, 3, and 5 gates on rectangles, handwritten digits, shape set, and CIFAR-10. All ofthe experiments exhibit the trend of becoming orthogonal as training proceed. Thus we are confident in the correctness of theconclusion, even thought they are only experimental results. What’s more, we do not think there are any obvious similaritiesamong these four data sets, thus we believe that this is a general conclusion which holds independent of data set.

Figure 6 takes three gates trained on rectangles as an example.

6

Figure 6: Rectangles trained on three gates

We also noticed that the decrease of error is combined with one pair or more pairs of gates becoming orthogonal. Figure 7 takesfive gates trained on handwritten digits as an example.

Figure 7: Handwritten digits trained on five gates

It would be great if we can prove some of the properties we have discovered. After all there are few theoretical results in thisarea.

ReLU case: many local minimums

We do experiments using autoencoders with single hidden layer, three hidden gates, and ReLU activation function. The datasetis all possible rectangles on the 10 by 10 grid. We found that training from different initial weights gives different local minimumsand different errors. Figure 8 is for 30 repeated experiments. For easiness of comparing error values, we sort the errors.

7

Figure 8: error distribution for the ReLU case

Nonlinear case does not admit greedy algorithm

It is well know that calculating principle components admits greed algorithms. It implies that when training a linear autoencoder,no matter whether gates are trained one by one or trained together,we get equivalent weights. However, this does not hold fornonlinear autoencoders. We did experiments on rectangles, using single hidden layer autoencoders with sigmoid activationfunction. In this case training gate by gate does not give a local minimum.

What happens as weights become orthogonal

Figure 9 is for a single hidden layer autoencoder with four sigmoid gates, trained on rectangles. We care about how innerproducts between weights change as training proceeds.

8

Figure 9: inner product vs training iteration

Following are some interesting properties:

• At the beginning and end of training, pairs of weights are almost orthogonal.

• Pairs of weights become orthogonal very quickly as soon as they start to become orthogonal.

• As one pair of weights becomes orthogonal, other weights need to adjust themselves to keep being orthogonal. That is whywhen the red and pink line drop, the blue line increases. That is why when the green line drops, the yellow and black lineincrease.

Why sigmoid gives orthogonal weights but ReLU does not

Experiments show that sigmoid can converge to orthogonal weights but ReLU can not, even if the value domain of ReLU isconstrained to be [0, 1]. To be specific, we regard the output of ReLU as 1 whenever it is greater than 1. And in this subsectionby ReLU we mean the constrained version. There are two possible properties to explain the difference between sigmoid andReLU.

• property 1: ReLU does not have derivative at (0, 0) and (1, 1).

• property 2: The derivative of ReLU(x) is exactly zero when x < 0 and x > 1.

To see which property causes the difference, we construct the following two functions:

f1(x) =

0 x < −π2

12 + sin(x)

2 −π2 ≤ x ≤π2

1 π2 < x

f2(x) = 12sigmoid(x) + 1

2ReLU(x)

So that f1 has property 2 but does not have property 1. f2 has property 1 but does not have property 2.

The autoencoder used here has single hidden layer with three gates, trained on rectangles. Figure 10 shows that f1 generatesnon-orthogonal weights but f2 generates orthogonal weights. Thus property 2 stops ReLU from getting orthogonal weights.

9

Figure 10: inner product of weights for f1(left) and f2(right)

A simplified case:

Here we consider a simplified case where there is only one sigmoid hidden gate, with no bias units and no sigmoid for the outputgates.

Following are useful notations. All vectors are column vectors.

• z is the activation vector for the single hidden gate.

• u is the weight vector connecting the hidden layer and output layer, and ui is the weight for the ith output gate.

• X is the input matrix. It is a n by d matrix, where n is the number of input patterns and d is the dimension of one inputpattern.

• J is the error. Ji is the error related to the ith output gate.

• w is the weight vector connecting input layer and hidden layer.

• σ is the sigmoid function.

Since we are using sum square error, we have

Ji = (zui −Xi)T (zui −Xi) = uiz

T zui −XTi zui − zTXiui + Ci (1)

where Ci = XTi Xi is just a constant irrelevant to weights.

Since zT z, XTi z, and zTXi are just scalars, we have the following partial derivative based on equation (1)

∂Ji∂ui

= 2uizT z − 2XT

i z (2)

Let equation (2) to be zero, we solve the optimum ui for a given z is:

ui =XTi z

zT z(3)

Plug equation (3) into equation (1), we get the optimum Ji for a given z is:

Ji = (zXTi z

zT z−Xi)

T (zXTi z

zT z−Xi) = − (XT

i z)2

zT z+ Ci (4)

Thus what we want to minimize is

J =

d∑i=1

Ji =

d∑i=1

(− (XTi z)

2

zT z+ Ci) = −z

TXXT z

zT z+ C (5)

10

Where C =d∑i=1

Ci is just a constant irrelevant to weights.

Minimizing J is equivalent to maximizezTXXT z

zT z(6)

under the constrain that there exists a weight vector w such that

z = σ(Xw) (7)

Following is a way to interpret what we want to maximize (equation 6). That is, if we view X as d points living in a n dimensionspace and if we can choose any z, the optimum z is the most important component of these points. Equivalently, the optimum zis the first singular vector of X. But since z has to satisfy the constrain in equation (7), we need to do more to figure out whatthe optimum z is. But if we get the optimum z, we will definitely get the optimum w and u.

Multiple hidden layers

When there are multiple hidden layers in a network, one hidden layer only receives information from the (hidden) layer beforeit. Thus the i+ 1th hidden layer can not have more information than the ith hidden layer. However, as we increase the numberof hidden layers, we do observe that the error is decreasing (Figure 11). Our explanation is that, compared with former layers,latter layers have better representation of the original input. However, the specific meaning of being better is quite vague. Itentatively define how good a representation is as follow:

Suppose for a hidden layer there are m hidden gates, where each input pattern corresponds to an activation vector of lengthm. Since the activation value is the output of sigmoid function, each coordinate of the activation vector lies in (0, 1), thus eachinput pattern corresponds to a point in a m dimensional unit cube. When I do experiments with rectangles, each input patternis distinct, thus we hope that they corresponds to different points in the m dimensional unit cube, or the latter layers can nottell them apart. Motivated by this intuition, I calculate the minimum Euclidean distance among all pairs of points in the unitcube, and I think a larger minimum distance means a better representation.

Based on this definition, I did the following three experiments.

11

Figure 11: error vs number of hidden layers

Experiment 1: minimum distance vs index of hidden layer

This experiment uses three hidden gates in each layer, with two, three, and four hidden layers. The dataset is all possiblerectangles on the 3 by 3 grid. Figure 12 shows that within the same autoencoder, latter layers have larger minimum distancesthan former layers. It falls within out expectation because we hope the representation of latter layers are better than formerlayers.

12

Figure 12: minimum distance vs index of hidden layer

Experiment 2: minimum distance vs number of iteration

This experiment uses three hidden layers, where the first and third hidden layers have 10 hidden gates, and the second hiddenlayer has 2 hidden gates. I use such a structure because I want the representation of the second layer to be as good as possible, aslong as there is no over fitting. The dataset is all possible rectangles on the 3 by 3 grid. Figure 13 shows the minimum distanceincreases as training proceeds. It falls within our expectation because we hope the representation improves as training proceeds.

13

Figure 13: minimum distance vs number of iteration

Experiment 3: minimum distance vs error

This experiment uses the same network structure and dataset as experiment 2. I trained the network with different initial weights,thus I get different minimum distances and corresponding errors when training converges. Figure 14 shows the minimum distanceis negatively related to the errors, which falls within our expectation because we hope a large minimum distance means a betterrepresentation.

14

Figure 14: minimum distance vs error

A new network structure extracting nonlinear principle components

We have found that the auto-encoder with linear activation function (called linear auto-encoder) is equivalent to principal com-ponent analysis using eigenvectors. However, they are considered “not interesting” because they can only use linear componentsto encode the original dataset, which behaves poorly both on some artificial non-linear data and some natural data such asthe handwritten numbers. On the other hand, while non-linear autoencoders with sigmoid or piecewise activation function canencoder the datasets much better, we have not known much about what they are encoding.

Recently I come up with a new kind of network that seems can help to solve the dilemma above. This kind of network canautomatically find the non-linear components when it tries to encode and decode data. More importantly, it extracts the non-linear components in the way that we human can except or understand. Figure 15 are two of the experimental results on someartificial data.

15

0.70.6

0.50.4

0.30.2

0.10

0

0.5

0.8

0

0.2

0.4

0.6

1

0.8

0.6

0.4

0.2

00.8

0.6

0.4

0.2

0

0

0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 15: Experimental results of twisted cylinder (left) and first octant sphere (right)

In Figure 15, blue dots are the original data and red dots are the data after encoding and decoding. The yellow and green curvesare the non-linear principal components that are extracted by the network automatically. These non-linear principal componentsfall within our expectation.

The structure of the network is shown in Figure 16:

Figure 16: Network structure

Put orthogonal constrain on the weights of neutral network

orth rate is a parameter that I introduced when I implemented the network. The larger orth rate is, the more we punish thenetwork when weights are not orthogonal.

I found the following:

• Orthogonal network converges much more slower than non-orthogonal networks, the larger orth rate is, the slower itconverges

• But when it converges, it performs as well as non-orthogonal networks, no matter how large orth rate is.

• The cost of being non-orthogonal is almost always zeros when training converges, (total cost is more than 106 timeslarger than orthogonal cost) which means the network with orthogonal constrain almost has the same ability in fittingfunctions, compared with network without orthogonal constrain. Because if the orthogonal network lacks the ability infitting functions, it will increase the orthogonal cost to decrease total cost. In the linear case, for each converged weights,it has equivalent orthogonal weights. I believe something similar happens in the non-linear case.

• The orthogonal network encodes things in the way that we human think it should encode. In another word, it extractsthe parameters that we think are most proper. Figure 17 is the comparison between the orthogonal network and the

16

non-orthogonal network when they try to encode a cylinder. The coloured lines are the implicit grid imposed by the hiddengates.

0.90.80.70.60.50.4

x

0.30.20.100

0.5

0.8

1

0.6

0.4

0.2

0

y

1

z

original

learned

10.8

0.6

x

0.40.2

00

0.5

y

0.4

0

0.2

1

0.8

0.6

1

z

original

learned

Figure 17: Experimental results of normal network (left) and orthogonal constrained network (right)

And I think Figure 18 shows the relationship between the ability of different networks we are focusing on these days. By abilityI mean the ability of fitting functions. If there is a pointer from A to B, it means A has stronger ability than B.

References

[1] https://en.wikipedia.org/wiki/Artificial neural network

[2] Baldi, P. & Hornik, K. Learning in Linear Neural Networks: a Survey

[3] Baldi, P. & Hornik, K. Neural networks and principal component analysis: learning from examples without local minima.

17

documentation of research on deep learning

Documents