shusdudphwhuvlq...

7
Evolutionary Optimization of Hyperparameters in Deep Learning Models Jin-Young Kim Dept. of Computer Science Yonsei University Seoul, South Korea [email protected] Sung-Bae Cho Dept. of Computer Science Yonsei University Seoul, South Korea [email protected] Abstract—Recently, deep learning is one of the most popular techniques in artificial intelligence. However, to construct a deep learning model, various components must be set up, including activation functions, optimization methods, a configuration of model structure called hyperparameters. As they affect the performance of deep learning, researchers are working hard to find optimal hyperparameters when solving problems with deep learning. Activation function and optimization technique play a crucial role in the forward and backward processes of model learning, but they are set up in a heuristic way. The previous studies have been conducted to optimize either activation function or optimization technique, while the relationship between them is neglected to search them at the same time. In this paper, we propose a novel method based on genetic programming to simultaneously find the optimal activation functions and optimization techniques. In genetic programming, each individual is composed of two chromosomes, one for the activation function and the other for the optimization technique. To calculate the fitness of one individual, we construct a neural network with the activation function and optimization technique that the individual represents. The deep learning model found through our method has 82.59% and 53.04% of accuracies for the CIFAR-10 and CIFAR-100 datasets, which outperforms the conventional methods. Moreover, we analyze the activation function found and confirm the usefulness of the proposed method. Keywords—Genetic programming, deep learning, neural networks, activation function, optimization technique. I. INTRODUCTION Deep learning has successfully accomplished the toughest tasks that require large amounts of data [1-3]. However, to build a deep learning model, it is necessary to define many hyperparameters, such as loss function, proper model layer configuration, activation function type, optimization technique, etc. Moreover, it has a huge search space to represent loss functions, activation functions, optimization techniques, and so on, which must be represented by a formula. For example, let us express the equations as ( ), ( )൯ , where is a binary function, is an unary function, and is operand. If we use arithmetic operators as a binary function, identity, negation, logarithm, exponential, trigonometric function, square root, and square as a unary function, and constants zero, one and variable as an operand, the search space becomes 4×7 ×3 in total, reaching about 2,000. Even searching for two formula is tricky to handle with the large search space which has a size of 3,000 = 4,000,000 . Therefore, it is a laborious task to construct a model for each task. These hyperparameters do not act independently of each other but interact and affect the performance of the model. Among the hyperparameters, the activation function and the optimization technique are related to the forward process connecting the model input to the output and the backward process propagating from the output to the model weights, as shown in Fig. 1. They have been studied to complement each other, e.g., vanishing gradient problem [4][5]. It has been investigated to find the optimal activation function and optimization technique. Many researchers have proposed a good formula for them through the theoretical approach [5-9]. It is not surprising, however, to use it to automatically find the formula for them as it solves many problems with deep learning [10-14]. The early “neuro- discovery” research included neuro-evolution [15-20]. Various approaches have been proposed from random search to grid search, search using evolutionary operation, and search using deep learning. Recent studies of this automation focus on only one of a large number of components of deep learning. This is a limitation in that the components of deep learning are closely related to each other when the model is trained. To overcome this limitation, we propose a method of finding both activation function and optimization technique at the same time by focusing on the relation between forward and backward processes in the activation function and optimization technique in the deep learning. Since the search space of deep learning model is very wide, the model is searched based on genetic programming to find the optimal point in a large search space. We encode the expression of the activation function and the 978-1-7281-2153-6/19/$31.00 c 2019 IEEE 831

Upload: others

Post on 12-Feb-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

Evolutionary Optimization of Hyperparameters in Deep Learning Models

Jin-Young Kim Dept. of Computer Science

Yonsei University Seoul, South Korea

[email protected]

Sung-Bae Cho Dept. of Computer Science

Yonsei University Seoul, South Korea [email protected]

Abstract—Recently, deep learning is one of the most popular techniques in artificial intelligence. However, to construct a deep learning model, various components must be set up, including activation functions, optimization methods, a configuration of model structure called hyperparameters. As they affect the performance of deep learning, researchers are working hard to find optimal hyperparameters when solving problems with deep learning. Activation function and optimization technique play a crucial role in the forward and backward processes of model learning, but they are set up in a heuristic way. The previous studies have been conducted to optimize either activation function or optimization technique, while the relationship between them is neglected to search them at the same time. In this paper, we propose a novel method based on genetic programming to simultaneously find the optimal activation functions and optimization techniques. In genetic programming, each individual is composed of two chromosomes, one for the activation function and the other for the optimization technique. To calculate the fitness of one individual, we construct a neural network with the activation function and optimization technique that the individual represents. The deep learning model found through our method has 82.59% and 53.04% of accuracies for the CIFAR-10 and CIFAR-100 datasets, which outperforms the conventional methods. Moreover, we analyze the activation function found and confirm the usefulness of the proposed method.

Keywords—Genetic programming, deep learning, neural networks, activation function, optimization technique.

I. INTRODUCTION

Deep learning has successfully accomplished the toughest tasks that require large amounts of data [1-3]. However, to build a deep learning model, it is necessary to define many hyperparameters, such as loss function, proper model layer configuration, activation function type, optimization technique, etc. Moreover, it has a huge search space to represent loss functions, activation functions, optimization techniques, and so on, which must be represented by a formula. For example, let us express the equations as 𝑓 𝑢 (𝑜𝑝 ), 𝑢 (𝑜𝑝 ) , where 𝑓 is a binary function, 𝑢 is an unary function, and 𝑜𝑝 is operand. If we

use arithmetic operators as a binary function, identity, negation, logarithm, exponential, trigonometric function, square root, and square as a unary function, and constants zero, one and variable 𝑥 as an operand, the search space becomes 4 × 7 × 3 in total, reaching about 2,000. Even searching for two formula is tricky to handle with the large search space which has a size of 3,000 = 4,000,000 . Therefore, it is a laborious task to construct a model for each task.

These hyperparameters do not act independently of each other but interact and affect the performance of the model. Among the hyperparameters, the activation function and the optimization technique are related to the forward process connecting the model input to the output and the backward process propagating from the output to the model weights, as shown in Fig. 1. They have been studied to complement each other, e.g., vanishing gradient problem [4][5].

It has been investigated to find the optimal activation function and optimization technique. Many researchers have proposed a good formula for them through the theoretical approach [5-9]. It is not surprising, however, to use it to automatically find the formula for them as it solves many problems with deep learning [10-14]. The early “neuro-discovery” research included neuro-evolution [15-20]. Various approaches have been proposed from random search to grid search, search using evolutionary operation, and search using deep learning. Recent studies of this automation focus on only one of a large number of components of deep learning. This is a limitation in that the components of deep learning are closely related to each other when the model is trained.

To overcome this limitation, we propose a method of finding both activation function and optimization technique at the same time by focusing on the relation between forward and backward processes in the activation function and optimization technique in the deep learning. Since the search space of deep learning model is very wide, the model is searched based on genetic programming to find the optimal point in a large search space. We encode the expression of the activation function and the

978-1-7281-2153-6/19/$31.00 c©2019 IEEE 831

Page 2: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

optimization technique into parse trees and place them into chromosomes of individual. Each individual 𝑥 differs from other individuals in a common neural network 𝑁𝑒𝑡(⋅,⋅) by the activation function and optimization technique of the chromosomes 𝑐 , and c . 𝑁𝑒𝑡 𝑐 , 𝑐 constructed with individual 𝑥 learns with CIFAR-10 dataset, and the accuracy is used as the fitness of 𝑥 . We use selection, mutation, and crossover as operators for the next generation based on individuals' fitness. A detailed description of each operation is given in Section III. The main contribution of this paper is as follows.

We propose a novel method of encoding two or more components of deep learning into one individual and searching them simultaneously.

The activation function and the optimization technique from the search results have a higher performance than the combination of the conventional ones.

Unlike previous research that constructed heuristic methods and tested their performance, we search first and analyze the features of the formula.

Fig. 1. Simple calculation process of deep learning. The value is forward passed by activation function and backward passed by optimization technique. 𝒍𝒊 represents the ith layer of deep learning model.

The remainder of this paper is organized as follows. Section

II introduces the previous research related to automatic search of the deep learning and addresses the limitations. To overcome these limitations, we propose a method in Section III and verify its performance in Section IV. Conclusions and future works are discussed in Section V.

II. RELATED WORK

Section II summarizes the previous studies to search the hyperparameters of the deep learning. There are two main types of research: heuristic and autonomous ones. We have summarized them in Table I.

Intensive research has been conducted to find the optimal activation function and optimization technique. From the stochastic gradient descent which became the basis of training the deep learning, there was a change in the method of obtaining the gradient from the Nesterov SGD [21][22]. Several optimization techniques such as Adagrad, RMSProp, Adadelta, and Adam were also proposed to better train the weight of the model [7-9][23]. These are the optimization techniques to add a momentum and an accelerator to the gradient to make learning more stable and effective. A rectified linear unit (ReLU), Leaky ReLU (LReLU), exponential linear unit (ELU), self-normalized linear unit (SeLU), etc. have been proposed to obtain a good performance of deep learning [5][6][24][25].

Agostinelli designed a learnable activation function to improve deep neural network, resulting in adaptive piecewise

linear function 𝜎(𝑥) = 𝑚𝑎𝑥(𝑥, 0) + 𝛴 𝑎 (−𝑥 + 𝑏 , 0) [26]. Zhang found an optimal activation function for each neuron by using a genetic algorithm [27]. Young et al. proposed a method to evolve the deep learning model using evolutionary operations on kernel size [28]. Orchard searched an optimization technique based on Hebbian update rule through genetic algorithm [29]. Similar to Ramachandran, Bello used reinforcement learning to design an optimizer, resulting in AddSign and PowerSign [30]. Alber proposed a method to find better backpropagation algorithm by using a genetic algorithm [31].

TABLE I. THE RELATED WORK OF SEARCHING OPTIMAL

HYPERPARAMETERS Category Author Description

Manual searching

Rumelhard (1986) Stochastic gradient descent

Nesterov (2013) Nesterov accelearated gradient

Duchi (2011) Addition of accelerator to update weights fairly

Tieleman (2012)

Linear transformation of accelerator to avoid exploding accelerator

Zeiler (2012) Improvement of performance by matching unit

Kingma (2014)

Combination of momentum and accelerator

Nair (2010) Rectified linear unit

Maas (2013) Leaky ReLU to prevent dying ReLU problem

Clevert (2015) Exponential linear unit

Klambauer (2017) Self-normalized linear unit

Automatic searching

Agostinelli (2015)

Design learnable activation function

Zhang (2015) Find optimal activation for each neuron

Orchard (2016)

Search optimization technique using Hebbian update rule

Bello (2017) Propose AddSign and PowerSign using reinforcement learning

Alber (2018) Discover new propagation rules with genetic algorithm

Although the activation function and the optimization technique affect each other in the deep learning, there is no research to optimize the two hyperparameters at the same time. Therefore, we propose a novel method of encoding two chromosomes for activation function and optimization technique into one individual and evolving them using genetic programming.

832

Page 3: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

III. THE PROPOSED METHOD

To automatically find the activation function and optimization technique of the deep learning, we have evolved the population 𝑃 of the models. The population of the jth-generation using evolutionary operators is called 𝑃 and the evolution process is introduced in the next subsection. We define the neural network with the activation function and the optimization technique represented by chromosome 𝑐 , 𝑐 of each individual 𝑥 in the population as the child network 𝑁𝑒𝑡 𝑐 , 𝑐 . We calculate the fitness by learning and verifying with the same dataset in a child network that represents each individual within a population in a generation. Finally, based on the calculated fitness, operators of evolutionary operations such as selection, mutation, and crossover are used to produce the population of the next generation.

We show the activation function and the optimization technique as (1) and (2) and try to evolve 𝑓, 𝑔 in each equation.

𝑥 = 𝑓(𝑥 ), (1)

𝜃 = 𝜃 − 𝑔 𝛻 𝐽(𝜃) , (2)

where 𝑥 is an input of the ith layer, 𝑓 is an activation function, 𝜃 is weights of deep learning model, 𝑔 is an optimization techinique, 𝐽(⋅) is a loss function, and 𝛻 is a gradient with respect to 𝜃.

A. Genetic Programming

a) Search Space: We use the domain-specific language (DSL) inspired by Bello et al. to express the formula of the activation function and optimization technique. The DSL expresses each equation as 𝑓 𝑢 (𝑜𝑝 ), 𝑢 (𝑜𝑝 ) , where 𝑓(⋅,⋅) is a binary function, 𝑢 (⋅) and 𝑢 (⋅) are unary functions, and 𝑜𝑝 and 𝑜𝑝 are possible operands. The sets of operands, unary functions, and binary functions are manually specified but the components of individual are selected automatically based on the genetic operators. Examples of each component are as follows: Operands: 0, 1, 2 (constants), 𝑥 (variable)

Unary functions 𝑢(𝑥) : 𝑥 (identity), −𝑥 (negation), 𝑥 (square), √𝑥 (square root), 𝑙𝑜𝑔(|𝑥| + 𝜖) (logarithm), 𝑒 (exponential), 𝑠𝑖𝑛 𝑥 , 𝑐𝑜𝑠 𝑥 (trigonometric function), 𝑡𝑎𝑛ℎ 𝑥 (hyperbolic function), 𝑚𝑎𝑥(𝑥, 0) , 𝑚𝑖𝑛(𝑥, 0) (comparison with zero), 𝜎(𝑥) (sigmoid function), 𝑐𝑙𝑖𝑝( , )(𝑥) (clip values in range [-1,1]), 𝑛𝑜𝑟𝑚(𝑥) (normalization),and 𝑥 + 𝜂 (addition of noise), where σ is sigmoid and 𝜂 ∼ 𝒩(0,0.1 )

Binary functions 𝑓(𝑥, 𝑦) : 𝑥 + 𝑦 (addition), 𝑥 − 𝑦 (subtraction), 𝑥 ∗ 𝑦 (multiplication), 𝑥/𝑦 (division), 𝛽 ∗𝑥 + (1 − 𝛽) ∗ 𝑦 (linear transformation), 𝑚𝑎𝑥(𝑥, 𝑦), and 𝑚𝑖𝑛(𝑥, 𝑦) (comparison), where 𝛽 = 0.1

The resulting quantity 𝑓 𝑢 (𝑜𝑝 ), 𝑢 (𝑜𝑝 ) can be used recursively as operands in subsequent parts of the equation. Besides, the outermost function may be a unary function due to a genetic operator. Using these components, we search for equations on search space with (7 × 15 × 4 ) ≈ 6.35 × 10 expressions without recursive expressions. In this paper, we use

a functional value as operands recursively so that we search for larger space in exponential order.

b) Encoding Method: We use a parse tree to convert the phenotype of a chromosome represented by DSL into a genotype. The internal nodes of the tree represent binary or unary functions, and all leaf nodes represent operands. The proposed method has two parse trees for each individual because it has to express two equations for the activation function and optimization technique in one individual. For example, if an individual has an activation function with the ReLU function shown in (3) and the optimization technique with an identity function as chromosomes, it can be depicted as shown in Fig. 2.

𝑓(𝑥) =𝑥 if 𝑥 ≥ 0 0 otherwise

(3)

Fig. 2. An example of each individual’s genotype that is represented as parse tree. In this example, the individual represents the phenotype with the ReLU function and the identity function as the activation function and optimization technique, respectively.

When we first create a population for evolution, we construct

a parse tree of the individual’s chromosomes with one binary function, two unary functions, and two operands so that the height is 3 as shown in Fig 2. Each node is set at random and at least one leaf node is set to a variable in one parse tree.

c) Genetic Operators: The genetic operators used in this paper consist of three types: selection, crossover, and mutation. Selection is the operator that selects the parent to construct the next generation. According to the usual fact that good children come from parents with good heredity, fitness proportionate selection is used as roulette wheel selection. As shown in Fig. 4, 𝑝 , which is the probability that 𝑥 is selected as the parent, increases according to the fitness value. We choose two parents in the population to create the next generation of children.

𝑝 = , (4)

where 𝑓 is a fitness value of individual 𝑥 and 𝑁 is the number of individuals in the population.

Crossover is an important operator of evolutionary computation. This is the process of mixing the chromosomes of two selected parents. Our proposed method has two chromosomes in an individual, so the operation used differs slightly from the conventional evolutionary operation. The chromosome for the activation function and the optimization technique should not influence each other in the crossover process because they interact only when calculating fitness and are formally independent. Thus, as shown in Fig. 3, we split each

833

Page 4: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

crossover process and mix the parse trees that represent the same property.

Fig. 3. An example of crossover process. Unlike conventional crossover operator, the chromosomes representing the same property are mixed.

Before the two parse trees are mixed, each mix point is

determined by selecting nodes from each tree at random. Next, crossover operations are performed by exchanging subtrees rooted at each mix point.

Mutation is an operation to increase the diversity within a population. Like the crossover, the mutation is applied to the individual’s chromosomes independently. Each node in the parse trees is selected at random to determine the mutation point. Next, we remove the subtree rooted at each mutation point and replace it by randomly creating a parse tree in the same way as the initial population creation. This process is shown in Fig. 4.

Fig. 4. An example of mutation process. Colored boxes represent mutation point. Parse trees which are out of individual A are randomly produced to mutate chromosome of A.

The overall process of the genetic operator when going to the

next generation is summarized as follows.

1. Select a pair of parent individuals from the current population, the probability of selection being proportional to fitness. It can occur that the same individual can be selected more than once to become a parent.

2. Crossover the subtrees at the randomly chosen node to form two offspring with probability 𝑝 . If no crossover takes place, selected parent form two offspring that are exact copies of their respective parents.

3. Mutate the two offspring at each node with probability 𝑝 , and place the resulting individuals in the new population.

4. Repeat 1-3 steps until 𝑁 offspring have been created.

5. Replace the current population with the new population.

B. Neural Network

To evaluate the performance of the activation function and the optimization technique represented in the chromosome of each individual, we modify the components of neural network 𝑁𝑒𝑡(⋅,⋅) and construct child neural network 𝑁𝑒𝑡 𝑐 , 𝑐 for individual 𝑥 .

The neural network used in this paper is wide residual networks (WRN) proposed by Zagoruyko and Komodakis [32]. This model is a method to adjust the width of the layer in the skip connection part of existing residual networks (ResNet) [33]. We implement almost exactly the same method introduced in WRN except for the learning decay because the model is trained in a short time. We use a WRN with 16 layers and a width multiplier of 2 (WRN 16-2) to calculate the fitness of each individual. We use He et al. 's method to initialize the weights of the model [34].

C. Training Process

This section describes the process of searching for optimal activation functions and optimization techniques across generations. We calculate the fitness of each individual 𝑥 with an accuracy of WRN including 𝑥 ’s activation function and optimal technique. For each generation, every child neural network is trained 𝑁 times for fitness evaluation, and every time the generation is changed, it goes through the genetic operator process introduced in the previous section. From the results of the last 𝑁 generation, only individuals with high fitness were extracted and the final verification results were calculated by newly learning 𝑁 times. This learning method is explained in Algorithm 1. Score(𝑁𝑒𝑡(⋅,⋅), 𝒟) is the result of verifying the neural network 𝑁𝑒𝑡(⋅,⋅) with data 𝒟 and NextGeneration(𝐺 ) is the function of generating the next generation by applying genetic operators to 𝐺 . The input of the algorithm is the training data 𝒟 , the test data 𝒟 , and the structure of the neural network, 𝑁𝑒𝑡(⋅,⋅) . The output of the algorithm is an individual with the optimal activation function and the optimization technique, along with the verification result.

Algorithm 1. Evolution process Input: 𝒟 , 𝒟 , 𝑁𝑒𝑡(⋅,⋅) Output: individual x, score for 𝑗 = 1, … , 𝑁 do if 𝑗 = 1 then Initialize 𝐺 = {𝑥 , … , 𝑥 }; end if if 𝑗 ≠ 1 then 𝐺 ← 𝑁𝑒𝑥𝑡𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛(𝐺 ); end if for 𝑖 = 1, … , 𝑁 do train 𝑁𝑒𝑡 𝑐 , 𝑐 with 𝒟 ; 𝑓 ← Score 𝑁𝑒𝑡 𝑐 , 𝑐 , 𝒟 ; end for end for 𝑥 ← 𝑥 ;

testscore ← Score(𝑁𝑒𝑡(𝑐 , 𝑐 ), 𝒟 ); return 𝑥, testscore

834

Page 5: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

IV. EXPERIMENTS

A. Experimental Settings

To verify the performance of the proposed method, we used CIFAR-10 and CIFAR-100 datasets [35]. We set the size of population to 40, 𝑝 to 0.5, and 𝑝 to 0.05. Population evolved for a total of 100 generations (𝑁 = 100) and neural network was trained for 20 epochs to calculate the fitness (𝑁 = 20). Early stopping was used to save learning time for individuals that do not improve. We experimented to limit the maximum height to 8 to prevent the parse tree from getting too high and complicated. When creating a parse tree, we added ReLU, tanh, sigmoid, and ELU into initial population to create a pool at the beginning of evolution.

The operating system of the computer used in our experiments was Ubuntu 16.04.2 LTS and the CPU of the computer was an Intel Xeon E5-2630V3. The random access memory of the computer was Corsair DDR4 32GB × 2, and the graphic processing unit of the computer was GTX 1080ti 11 GB. To construct a model, we use python 3.6 and Keras library.

B. Results

In this paper, to verify that better performance is obtained by exploring the activation function and the optimization technique together, we have conducted three experiments: searching for the activation function, searching for the optimizer, and searching for the both simultaneously. Among the experimental results, those that generally performed well are as follows.

Activation function: ReLU form (𝑓(𝑥) = max(𝑥, 0), 𝑓(𝑥) =max(abs(max(𝑥, 0))) , sin form ( 𝑓(𝑥) = max(sin 𝑥 , 0) , 𝑓(𝑥) = max(𝑥, 0) + sin 𝑥 − sin sin 𝑥 , 𝑓(𝑥) = sin 𝑥 +max(𝑥, 0), etc.).

Optimization technique: identity ( f(x) = x ), scalar multiplication (𝑓(𝑥) = 2𝑥 , 𝑓(𝑥) = √2𝑥 , etc.), clip (𝑓(𝑥) =clip( , )(𝑥)), etc.

Both: 𝑓(𝑥) = sin 𝑥 + max(𝑥, 0) for activation function, 𝑓(𝑥) = tanh(𝑥) for optimization technique

We name sin 𝑥 + max(𝑥, 0) as SineLU in the activation functions found. The fitness (accuracy) changes of individuals for each generation are shown in Fig. 5. We plot the maximum, minimum, and average fitness values in each generation with a blue, orange, and green colored line graph. To see how the population changes while evolving, we label individuals with properties similar to the activation functions and optimization technique found in this paper in blue star; sin form for activation function and tanh for optimization technique. The similar function is defined as a formula that uses a specific form recursively. Individuals expressed in blue star begin to increase in number from about 40 generations and continue to have the best fitness.

We compare the results of the proposed method and the experimental results by fixing one and modifying the other in order to confirm that the method of finding the activation function and the optimization technique at the same time is better than the conventional method. The followings are used for comparison.

Hyperbolic tangent

𝑓(𝑥) = tanh 𝑥

Rectified Linear Unit (ReLU) [5]

𝑓(𝑥) =𝑥 if 𝑥 ≥ 0 0 otherwise

Leaky ReLU (LReLU) [22]

𝑓(𝑥) =𝑥 if 𝑥 ≥ 0 𝛼𝑥 otherwise

where α = 0.2. LReLU enables a small amount of information to flow when 𝑥 < 0, which can avoid dying ReLU problem that information disappears when 𝑥 is negative.

SineLU

𝑓(𝑥) =sin 𝛽𝑥 + 𝛽𝑥 if 𝑥 ≥ 0 sin 𝛽𝑥 otherwise

Fig. 5. Change of the best, worst, and average values of accuracy over generations. Blue stars express individuals with properties similar to the activation functions and optimization technique found in this paper.

835

Page 6: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

where 𝛽 = 0.5. We use 𝛽 values between 0.5 and 1.0. For the optimization technique, identity, clip, addition of noise, and scalar multiplication (beta) are used.

We conduct the experiment on the CIFAR-10 dataset 10 times and organize the experimental results with the box plot as shown in Fig. 6. It can be seen that SineLU has the best performance for any optimization technique. Besides, since the result of simultaneous search of the activation function and the optimization technique resulted in 82.59%, which is higher than before, the simultaneous search method proposed in this paper is most effective.

Fig. 6. Experimental results of different activation functions and optimization techniques. SineLU outperforms other activation functions. Besides, the result of simultaneously searching method has the best performance.

The results are shown in Tables II and III for quantitative evaluation of the experiment. The tables show the results of various experiments with CIFAR-10 and CIFAR-100. Among the various combinations, 82.23% is the best performance, while the performances of the activation function and the optimization technique found by simultaneous search are 82.59% and 53.04% with CIFAR-10 and CIFAR-100, respectively.

TABLE II. CLASSIFICATION RESULTS OF CIFAR-10 DATASET OVER

DIFFERENT ACTIVATION FUNCTIONS AND OPTIMIZATION TECHNIQUES

Tanh ReLU LReLU SineLU

Identity 68.39 80.17 79.76 82.17

Clip 68.66 81.25 80.29 82.35

Noise 58.58 66.53 63.12 68.45

Beta 68.89 80.08 80.55 82.23

TABLE III.

CLASSIFICATION RESULTS OF CIFAR-100 DATASET OVER DIFFERENT ACTIVATION FUNCTIONS AND OPTIMIZATION

TECHNIQUES Tanh ReLU LReLU SineLU

Identity 36.96 49.88 47.26 51.39

Clip 35.68 49.62 48.24 52.38

Noise 24.74 32.93 32.86 34.57

Beta 36.24 52.16 47.87 52.95

C. Analysis

We analyze SineLU found through evolutionary computation in this paper. As can be seen in Fig. 7, the SineLU function has a value between ReLU and tanh in the positive range and is dynamic in the negative range. SineLU is unbounded and bounded below like ReLU, but SineLU is slightly smoother and more dynamic than ReLU. The derivative function SineLU is

𝑓 (𝑥) =𝑐𝑜𝑠 𝑥 + 1 if 𝑥 ≥ 0 𝑐𝑜𝑠 𝑥 otherwise

(5)

The first derivative function of SineLU is shown in Fig. 8. It can be seen that SineLU has less vanishing gradient than tanh and does not cause dying ReLU problem which is a phenomenon in which the function value becomes zero with an 𝑥 value of zero or less. Another advantage is that the SineLU can be graphically shown to have a better effect when using accelerators in the optimization technique. The accelerator is the value of how much the weight has changed:

𝐺 = 𝐺 + 𝛻 𝐽(𝜃) , (6)

where 𝐺 is the accelerator after seven times iterations.

As shown in Fig. 9, the squared derivative function of ReLU has only 0 or 1, whereas SineLU has a value between 0 and 1.

Fig. 7. Various activation functions including SineLU.

Fig. 8. First derivative function of various activation functions.

836

Page 7: SHUSDUDPHWHUVLQ 'HHS/HDUQLQJ0RGHOVsclab.yonsei.ac.kr/publications/Papers/IC/2019_CEC_JYK.pdf · (yroxwlrqdu\2swlpl]dwlrqri+\shusdudphwhuvlq 'hhs/hduqlqj0rghov-lq

Fig. 9. Squared derivative function of various activation functions.

V. CONCLUSIONS

In this paper, we presented the relationship between the activation function and the optimizer and a method to automatically find the equations for both. We proposed a method to encode two chromosomes in one individual to evolve several properties at the same time. As a result, we found that searched one has the best performance compared to the combination of the existing activation function and the optimization technique. Moreover, we analyzed the SineLU and proved that it is better than the conventional activation function from the view point of analysis.

We plan to evolve many properties simultaneously by encoding a more diverse chromosome (e.g., structure of neural network or hyperparameters) to an individual. We will construct larger search space for the novelty search and use more complex form of optimization technique formula. We will also conduct an experiment with various models and datasets for performance verification.

ACKNOWLEDGMENT This work was supported by the ICT R&D program of MSIP/IITP. [2017-0-00306, Development of Multimodal Sensor-based Intelligent Systems for Outdoor Surveillance Robots]

REFERENCES [1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,

no. 7553, pp. 436-444, 2015. [2] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural

Networks, vol. 61, pp. 85-117, 2015. [3] I. Goodfellow, J. Pouget-Abadie, M. Mirze, B. Xu, D. Warde-Farley, S.

Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” In Advances in Neural Information Processing Systems, pp. 2672-2680, 2014.

[4] S. Hochreiter, “The vanishing gradient problem during leraning recurrent neural nets and problem solutions,” Int. Journal of Unvertainty, Fuzziness and Knowledge Based Systems, vol. 6, no. 2, pp. 107-116, 1998.

[5] V. Nair and G. E. Hinton, “Recitified linear units improve restricted boltzmann machines,” Int. Conf. on Machine Learning, pp. 807-814, 2010.

[6] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” In Advances in Neural Information Processing Systems, pp. 971-980, 2017.

[7] J. Duchi, E. hazen, and Y. Singer, “Adaptive subgradient methods for inline learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. 7, pp. 2121-2159, 2011.

[8] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” arXiv preprint arXiv: 1212.5701, 2012.

[9] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv: 1412.6980, 2014.

[10] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Leraning Research, vol. 13, no. 2, pp. 281-305, 2012.

[11] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” In Advances in Neural Information Processing Systems, pp. 2951-2959, 2012.

[12] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural networks,” In Advances in Neural Information Processing Systems, pp. 1135-1143, 2015.

[13] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” arXiv preprint arXiv: 1611.02167, 2016.

[14] B. Zoph and Quoc V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv: 1611.01578, 2016.

[15] K. O. Stanley, “Compositional pattern producing networks: A novel abstraction of development,” Genetic Programming and Evolvable Machines, vol. 8, no. 2, pp. 131-162, 2007.

[16] T. Breuel and F. Shafait, “Automlp: Simple, effective, fully automated learning rate and size adjustment,” In the Learning Workshop, 2010.

[17] W. Zaremba, “An emprircal exploration of recurrent network architectures,” Int. Conf. on Machine Learning, pp. 2342-2350, 2015.

[18] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, and D. Wierstra, “Convolution by evolution: Differentiable pattern producing networks,” Genetic and Evolutionary Computation Conference, pp. 009-116, 2016.

[19] S.-B. Cho and K. Shimohara, “Evolutionary learning of modular neural networks with genetic programming,” Applied Intelligence, vol. 9, no. 3, pp. 191-200, 1998.

[20] Y. G. Seo, S.-B. Cho, and X. Yao, “The impact of payoff function and local interaction on the N-player interated prisoner’s dilemma,” Knowledge and Information Systems, vol. 2, no. 4, pp. 461-478, 2000.

[21] D. E. Rumelhard, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 553-536, 1986.

[22] Y. Nesterov, “Introductory lectures on convex optimization: A basic course,” vol. 87, Springer Science & Business Media, 2013.

[23] T. Tieleman and G. E. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” Coursera: Neural Networks for Machine Learning, vol. 4, no. 2, pp. 26-31, 2012.

[24] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” Int. Conf. on Machine Learning, pp. 3-8, 2013.

[25] D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv: 1511.07289, 2015.

[26] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830, 2014.

[27] L. M. Zhang, “Genetic deep neural networks using different activation functions for financial data mining,” Int. Conf. on Big Data IEEE, pp. 2849-2851, 2015.

[28] S. R. Young, D. C. Rose, T. P. Karnowski, S. H. Lim, and R. M. Ratton, “Optimizing deep learning hyper-parameters through an evolutionary algorithm,” Workshops on Machine Learning in High-Performance Computing Envrionments, ACM, pp. 4-8, 2015.

[29] J. Orchard and L. Wang, “The evolution of a generalized neural learning rule,” Int. Joint Conf. on Neural Networks IEEE, pp. 4688-4694, 2016.

[30] I. Bello, B. Zoph, V. Vasudevan, and Quoc V. Le, “Neural optimizer search with reinforcement learning,” Int. Conf. on Machine Learning, pp. 459-468, 2017.

[31] M. Alber, I. Bello, B. Zoph, P.-J. Kindermans, P. Ramachandran, and Q. Le, “Backprop evolution,” arXiv preprint arXiv: 1808.02822, 2018.

[32] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv peprint arXiv: 1605.07146, 2016.

[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Computer Vision and Pattern Recognition, pp. 770-778, 2016.

[34] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Int. Conf. on Computer Vision, pp. 1026-1034 ,2015.

[35] A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” Technical Report, University of Toronto, 2009.

837