DissLiteratur/storage/8E7VG78G/.zotero-ft-cache

arXiv:2007.14166v1 [cs.LG] 28 Jul 2020

International Journal of Pattern Recognition and Artiﬁcial Intelligence c World Scientiﬁc Publishing Company
A Comparison of Optimization Algorithms for Deep Learning
Derya Soydaner Statistics Department, Mimar Sinan Fine Arts University
I˙stanbul, 34380,Turkey derya.soydaner@msgsu.edu.tr
In recent years, we have witnessed the rise of deep learning. Deep neural networks have proved their success in many areas. However, the optimization of these networks has become more diﬃcult as neural networks going deeper and datasets becoming bigger. Therefore, more advanced optimization algorithms have been proposed over the past years. In this study, widely used optimization algorithms for deep learning are examined in detail. To this end, these algorithms called adaptive gradient methods are implemented for both supervised and unsupervised tasks. The behaviour of the algorithms during training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers and Labeled Faces in the Wild are compared by pointing out their diﬀerences against basic optimization algorithms.
Keywords: Adaptive gradient methods; optimization; deep learning; image processing.
1. Introduction Adaptive gradient methods have been widely used in deep learning. Although one of the most preferred algorithms has been stochastic gradient descent (SGD) for many years, it has diﬃculties to overcome serious problems such as ill-conditioning and time necessity for large-scale datasets when training deep neural networks. It requires manual tuning of learning rate and diﬃcult to parallelize 16. Thus, the problems of SGD caused the invention of more advanced algorithms. Nowadays, the optimization algorithms used for deep learning adapt their learning rates during training. Basically, the adaptive gradient methods adjust the learning rate for each parameter. Therefore, when the gradients for some parameters are large, their learning rates are reduced or vice versa.
Until recently, many adaptive methods have been proposed and they become the most commonly used alternatives to SGD. In addition to their high performance on training deep models, another advantage is they are ﬁrst-order optimization algorithms just as SGD. Thus, they are computationally eﬃcient for training deep neural networks. This work aims to present the most widely used adaptive optimization algorithms that are proven their superiority and compare their working principles. To this end, image processing that is one of the most important areas of deep learning is handled. Firstly, the eﬀects of adaptive gradient methods are observed for image classiﬁcation task by using convolutional neural networks (CNNs). Secondly,
1

2 D. Soydaner
as an unsupervised task, convolutional autoencoders (CAEs) that is one of the quintessential examples of unsupervised learning are used for image reconstruction. Besides, the eﬀects of algorithms are examined by using denoising autoencoders. By this way, the behaviours of the algorithms during training are analyzed in addition to their performances for both supervised and unsupervised learning tasks.
The rest of the paper is organized as follows. In Section 2, studies about adaptive gradient methods and the most recent variants of them are reviewed brieﬂy. In Section 3, widely used optimization algorithms in deep learning are explained by showing their update rules and solutions for challenges of training deep networks. SGD and its momentum variants are also mentioned in this section. The experiments and comparative results are in Section 4 and conclusion in Section 5.
2. Related Work
In deep learning literature, working principles and performance analysis of optimization algorithms are widely studied. For example, theoretical guarantees of convergence to criticality for RMSProp and Adam are presented in the setting of optimizing a non-convex objective 28. They design experiments to empirically study the convergence and generalization properties of RMSProp and Adam against Nesterov´s accelerated gradient method. In another study, conjugate gradient, SGD and limited memory BFGS algorithms are compared 16. A review is presented on numerical optimization algorithms in the context of machine learning applications 3. Additionally, similar to this work, an overview of gradient optimization algorithms is summarized 25.
In this study, most widely used optimization algorihms are examined in the context of deep learning. On the other side, new variants of adaptive methods still have been proposed more recently. For example, new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, are proposed 18. They employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD. Also, a new algorithm that adapts the learning rate locally for each parameter separately, and also globally for all parameters together is presented 9. Another new algorithm, called Nostalgic Adam (NosAdam), which places bigger weights on the past gradients than the recent gradients when designing the adaptive learning rate is introduced 12. In another study, two variants called SC-Adagrad and SC-RMSProp are proposed 20. A new adaptive optimization algorithm called YOGI is presented 30. It controls the increase in eﬀective learning rate. A novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner is developed 4. Also, a new algorithm called Adafactor is presented 26. Instead of updating parameters scaling by the inverse square roots of exponential moving averages of squared past gradients, Adafactor maintains only the per-row and per-column sums of the moving averages, and estimates the per-parameter second moments based on these sums.

A Comparison of Optimization Algorithms for Deep Learning 3
3. Optimization Algorithms with Adaptive Learning Rates
The choice of the algorithm to optimize a neural network is one of the most important steps. In machine learning, there are three main kinds of optimization methods. First one is called batch or deterministic gradient methods that process all training examples simultaneously in a large batch. Second one is called stochastic or online methods that use only one example at a time. Today, most algorithms are a blend of the two. During training, they use only a part of training set at each epoch. These algorithms are called minibatch methods. In deep learning era, minibatch methods are mostly preferred for two major reasons. Firstly, they accelerate the training of neural networks. Secondly, as the minibatches are selected randomly and they are independent, an unbiased estimate of the expected gradient can be computed 8.
In this paper, the most widely used minibatch-based adaptive algorithms are examined in detail. Besides, SGD that had been preferred conventionally for a long time is explained alongside its momentum variants, brieﬂy.

3.1. Stochastic gradient descent
Basically, SGD 24 follows the gradient of randomly selected minibatches downhill. In order to train a neural network using SGD, ﬁrstly, the gradient estimate is computed by using a loss function. Then, the update at iteration k is applied for parameters θ. The calculations for each minibatch of m examples from the training set x(1), ..., x(m) with corresponding targets y(i) are as follows:

1 gˆ ← m ∇θ

L(f (x(i); θ), y(i))

(1)

i

θ ← θ − kgˆ

(2)

Here, the learning rate k is a very important hyperparameter. The magnitude of the update depends on the learning rate. If it is too large, updates depend too much on recent instances. If it is small, many updates may be need for convergence 1. This hyperparameter can be chosen by trial and error. One way is to choose one of the several learning rates that results in the smallest loss function value. This is called line search. Another way is to monitor the ﬁrst several epochs and use a learning rate that is higher than the best performing learning rate. In Equation 2, the learning rate is denoted as k at iteration k because in practice, it is necessary to gradually decrease the learning rate over time 8.

3.1.1. Stochastic gradient descent with momentum
SGD has diﬃculty to reach global optimum because of its tendency to oscillate especially on steep surface curves. Noisy or small gradients may also be problematic. The method of momentum 22 is designed to accelerate learning in such cases. It

4 D. Soydaner

aims primarily to solve two problems: Poor conditioning of the Hessian matrix and variance in the stochastic gradient. The idea behind this algorithm is to take a running average by incorporating the previous update in the current change as if there is a momentum due to previous updates 1. When SGD is used with momentum, it can converge faster as well as reduced oscillation.
SGD with momentum uses a variable v called velocity. The velocity is the direction and speed at which the parameters move through parameter space. It is set to an exponentially decaying average of the negative gradient. Also, SGD with momentum requires a new hyperparameter α ∈ 0, 1 called momentum parameter that determines how quickly the contributions of previous gradients exponentially decay. The parameters are updated after the velocity update is computed:

1 v ← αv − m ∇θ

m
L(f (x(i); θ), y(i))

(3)

i=1

θ ← θ+v

(4)

The velocity v accumulates the gradient elements. The larger α is relative to , the more previous gradients aﬀect the current direction. Common values of α used in practice are 0.5, 0.9 and 0.99 8. However, a disadvantage of this algorithm is the requirement of momentum hyperparameter in addition to the learning rate.

3.1.2. Stochastic gradient descent with Nesterov momentum
SGD with Nesterov momentum 29 is proposed as a variant of the standard momentum by taking inspiration from Nesterov´s accelerated gradient method 21. The idea is to measure the gradient of the loss function not at the local position but slightly ahead in the direction of the momentum 7. This algorithm evaluates the gradient after the current velocity is applied with Nesterov momentum. Therefore, SGD with Nesterov momentum begins with an interim update for a minibatch 8:

θ˜ ← θ + αv

(5)

Then, gradient is computed at the interim point. By using this gradient, velocity update is computed. Finally, the parameters are updated:

g←

1 m ∇θ˜

m

L

f (x(i); θ˜), y(i)

(6)

i=1

v ← αv − g

(7)

θ ← θ+v

(8)

A Comparison of Optimization Algorithms for Deep Learning 5

3.2. AdaGrad
One of the optimization algorithms that individually adapts the learning rates of model parameters is AdaGrad 6. The parameters with the largest partial derivative of the loss have a rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate 8. This is performed by using all the historical squared values of the gradient.
AdaGrad uses an additional variable r for gradient accumulation. In the beginning of this algorithm, the gradient accumulation variable is initialized to zero and gradient is computed for a minibatch:

1 g ← m ∇θ

L(f (x(i); θ), y(i))

(9)

i

By using this gradient, the squared gradient is accumulated. Then, the update is computed by scaling learning rates of all parameters inversely proportional to the square root of the sum of all the historical squared values of the gradient. Finally, this update is applied to the model parameters:

r←r+g g

(10)

∆θ ← − √ g

(11)

δ+ r

θ ← θ + ∆θ

(12)

where is the global learning rate and δ is a small constant for numerical stability. However, AdaGrad has serious disadvantages. Generally, it performs well for
simple quadratic problems, but it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum 7. Also, for training deep neural networks, the accumulation of squared gradients from the beginning of training can result in an excessive decrease in the eﬀective learning rate. AdaGrad performs well for some but not all deep learning models 8.

3.3. AdaDelta
The underlying idea of AdaDelta algorithm is to improve the two main drawbacks of AdaGrad: The continual decay of learning rates throughout training and the need for a manually selected global learning rate. To this end, AdaDelta restricts the window of past gradients to be some ﬁxed size w instead of accumulating the sum of squared gradients over all time. As mentioned in the previous section, AdaGrad accumulates the squared gradients from each iteration starting at the beginning

6 D. Soydaner
of training. This accumulated sum continues to grow during training, eﬀectively shrinking the learning rate on each dimension. After many iterations, the learning rate becomes inﬁnitesimally small. With the windowed accumulation AdaGrad becomes a local estimate using recent gradients instead of accumulating to inﬁnity. Thus, learning continues to make progress even after many iterations of updates have been done 31.
Since storing w previous squared gradients is ineﬃcient, AdaDelta implements this accumulation as an exponentially decaying average of the squared gradients. Assuming this running average is E g2 t at time t, gradient accumulation is computed as follows:

E g2 t = ρE g2 t−1 + 1 − ρ gt2

(13)

where ρ is a decay constant similar to that used in the momentum method. Since it is required the square root of this quantity in the parameter updates, this eﬀectively becomes the root mean square (RMS) of previous squared gradients up to time t:

RM S g t = E g2 t + δ

(14)

where δ is again a small constant. Based on this RMS, the parameter update is computed, updates are accumulated and the parameters are updated, respectively:

∆θt

=

RM S ∆θ −
RM S g

t−1 t

gt

(15)

E ∆θ2 t = ρE ∆θ2 t−1 + 1 − ρ ∆θt2

(16)

θt+1 = θt + ∆θt

(17)

The advantage of AdaDelta is that it requires no manual tuning of a learning rate and appears robust to noisy gradient information, diﬀerent model architectures, various data modalities and selection of hyperparameters 31.

3.4. RMSProp
Another algorithm that modiﬁes AdaGrad is RMSProp 10. It is proposed to perform better in the nonconvex setting by changing the gradient accumulation into an exponentially weighted moving average. As mentioned in Section 3.2, AdaGrad shrinks the learning rate according to the entire history of the squared gradient. Instead, RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after ﬁnding a convex bowl 8.

A Comparison of Optimization Algorithms for Deep Learning 7
In order to implement RMSProp, squared gradient is accumulated after computing gradient:

r ← ρr + (1 − ρ)g g

(18)

where ρ is the decay rate. Then parameter update is computed and applied as follows:

∆θ = − √

g

(19)

δ+r

θ ← θ + ∆θ

(20)

3.5. Adam
Adam is one of the most widely used optimization algorithms in deep learning. The name Adam is derived from adaptive moment estimation because it computes individual adaptive learning rates for diﬀerent parameters from estimates of ﬁrst and second moments of the gradients. Adam combines the advantages of AdaGrad which works well with sparse gradients and RMSProp which works well in online and non-stationary settings 13.
There are some important properties of Adam. Firstly, momentum is incorporated directly as an estimate of the ﬁrst-order moment of the gradient. Also, Adam includes bias corrections to the estimate of both the ﬁrst-order moments and the second-order moments to account for their initialization at the origin 8. The algorithm updates exponential moving averages of the gradient mt and the squared gradient ut where the hyperparameters ρ1 ve ρ2 control the exponential decay rates of these moving averages. The moving averages themselves are estimates of the ﬁrst moment (the mean) and the second raw moment (the uncentered variance) of the gradient 13.
Adam algorithm requires ﬁrst and second moment variables m and u. After computing gradient, biased ﬁrst and second moment estimates are updated at time step t respectively:

mt ← ρ1mt−1 + (1 − ρ1)gt

(21)

ut ← ρ2ut−1 + (1 − ρ2)g g

(22)

Then, bias is corrected in ﬁrst and second moments. By using the corrected moment estimates parameter updates are calculated and applied:

mˆ t

←

1

mt − ρt1

(23)

8 D. Soydaner

uˆt

←

ut 1 − ρt2

(24)

∆θ = − √ mˆ t

(25)

uˆt + δ

θt ← θt−1 + ∆θ

(26)

Adam has many advantages. First of all, it requires a little tuning for the learning rate. Also, it is a method that is straightforward to implement and invariant to diagonal rescaling of gradients. It is computationally eﬃcient as well as a little memory requirements. Besides, Adam is appropriate for non-stationary objectives and problems with very noisy and sparse gradients 13.

3.6. AdaMax
AdaMax is proposed as an extension of Adam. It is a variant of Adam based on the inﬁnity norm. In Adam, update rule for individual weights is to scale their gradients inversely proportional to a L2 norm of their individual current and past gradients. AdaMax is based on the idea that L2 norm based update rule can be generalized to a Lp norm based update rule.
AdaMax algorithm begins with calculating gradients w.r.t. stochastic objective at time step t, as usual. Then, biased ﬁrst moment estimate and exponentially weighted inﬁnity norm are computed. By using them, the model parameters are updated. These steps are deﬁned below, respectively:

gt ← ∇θft θt−1

(27)

mt ← ρ1mt−1 + 1 − ρ1 gt

(28)

γt ← max ρ2γt−1, |gt|

(29)

θt ← θt−1 − / 1 − ρt1 mt/γt

(30)

It is showed that if AdaMax is preferred as optimization algorithm, there is no need to correct for initialization bias. Besides, the magnitude of parameter updates has a simpler bound with AdaMax than Adam 13.

A Comparison of Optimization Algorithms for Deep Learning 9
3.7. Nadam
Nadam (Nesterov-accelerated adaptive moment estimation) modiﬁes Adam’s momentum component with Nesterov´s accelerated gradient. Thus, Nadam aims to improve the speed of convergence and the quality of the learned models 5.
Similar to Adam, after computing gradient, ﬁrst and second moment variables are updated as in Equations 31 and 32. Then, corrected moments are computed and parameters are updated as in following equations:

mt ← ρtmt−1 + 1 − ρt gt

(31)

ut ← vut−1 + 1 − v gt2

t+1

t

mˆ ← ρt+1mt/ 1 − ρi + 1 − ρt gt/ 1 − ρi

i=1

i=1

uˆ ← vut/ 1 − vt

(32) (33) (34)

θt

←

θt−1

−

√t uˆt +

δ

mˆ t

(35)

3.8. AMSGrad
Another an exponential moving average variant is AMSGrad 23. The purpose of developing AMSGrad is to guarantee the convergence while preserving the beneﬁts of Adam and RMSProp. In AMSGrad algorithm, the ﬁrst and second moment variables are updated as in Equations 36 and 37. The key diﬀerence between AMSGrad and Adam is shown in Equation 38. Here, AMSGrad maintains the maximum of all ut until the present time step and uses this maximum value for normalizing the running average of the gradient instead of ut in Adam. By doing this, AMSGrad results in a non-increasing step size. Finally, the parameters are updated as in Equation 39. Here, Uˆt indicates diag uˆt .

mt ← ρt1mt−1 + (1 − ρt1)gt

(36)

ut ← ρ2ut−1 + (1 − ρ2)gt2

(37)

uˆt ← max uˆt−1, ut

(38)

10 D. Soydaner

θt+1 ←

√
F, Uˆt

θt −

tmt/

uˆt

(39)

On the other side, the diﬀerence of AMSGrad from Adam and AdaGrad, it

neither increases nor decreases the learning rate and furthermore, decreases ut which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations 23.

4. Experiments
4.1. Datasets
The performances of optimization algorithms are evaluated on four image datasets, namely, MNIST 17, CIFAR-10 14, Kaggle Flowers 19 and Labeled Faces in the Wild (LFW) 11. The well-known dataset MNIST includes 60000 training and 10000 test examples each of which is a 28×28 gray scale handwritten digit image. CIFAR-10 is composed of 50000 training and 10000 test examples. They are 32 × 32 color images belonging to 10 classes. Kaggle Flowers contains 4242 images of 5 diﬀerent types of ﬂowers. LFW includes 13233 face images belonging to 5749 people. In supervised learning experiments, LFW classes that have at least 30 images are chosen. LFW has 34 classes meet this requirement, thus 1777 face images belong to 34 people are used for classiﬁcation. On the other side, all the 13233 images are used in unsupervised learning experiments. Besides, LFW and Kaggle ﬂowers include color images in diﬀerent sizes. In this study, LFW and Kaggle Flowers are scaled to a size of 64 × 64 and 96 × 96, respectively. Also, they are randomly divided into two subsets as 0.75 for training and 0.25 for testing.

4.2. Experimental setting
In this study, both supervised and unsupervised learning tasks are handled. Firstly, three diﬀerent CNN architectures are used for classiﬁcation task. First one includes three convolutional and two fully-connected (FC) layers similar to wellknown LeNet-5 17. Second one has ﬁve convolutional and three FC layers that are organized in the architecture similar to AlexNet 15. Last one has seven convolutional layers that are stacked VGG-style 27. These architectures are shown in Table 1. All of them have convolutional layers that include ﬁlters with 3 × 3 kernels in addition to 2 × 2 max-pooling layers. In order to avoid overﬁtting, weight decay of 1e − 4 is applied to convolutional layers. Additionally, dropout in ratio 0.50 is applied to the FC layers preceding the output layer. All layers use ReLU as activation function except the output layer. The loss function is categorical cross entropy.
Secondly, convolutional autoencoders (CAE) are preferred for the unsupervised learning task. In general, a CAE comprises of two parts called encoder and decoder. While the encoder generates a representation of input data, the decoder takes this representation as input and reconstructs it in the output. When deciding

A Comparison of Optimization Algorithms for Deep Learning 11
the architecture of the autoencoder, the size of the representation is very important. Because when the size of representation becomes smaller, the reconstruction of the images becomes harder. Therefore, two architectures are preferred for the unsupervised learning experiments as described in Table 2. The encoder and decoder have symmetric convolutional-deconvolutional layers. Here, the diﬀerence between the encoder and decoder is that upsampling layers are used in decoder instead of maxpooling layers. The ﬁrst architecture, CAE-1, reduces the size of representation to one quarter while the second one, CAE-2, reduces it to one half. For example, the representation of 64 × 64 images of LFW is 1024 dimensional in CAE-1 and 2048 dimensional in CAE-2. For unsupervised learning experiments, the reconstruction loss is mean squared error and the activation function is tanh in the output layer. The rest of the layers use ReLU.

Table 1. CNN architectures used in supervised learning experiments.

CNN-1 Conv-32 MaxPool
Conv-64 MaxPool
Conv-128 MaxPool
FC-128 FC-softmax

CNN-2 Conv-32 MaxPool
Conv-64 MaxPool
Conv-128 Conv-128 Conv-128 MaxPool
FC-128 FC-256 FC-softmax

CNN-3 Conv-32 Conv-32 MaxPool
Conv-64 Conv-64 MaxPool
Conv-128 Conv-128 Conv-128 MaxPool
FC-128 FC-256 FC-softmax

Table 2. CAE architectures used in unsupervised learning experiments.

CAE-1 Conv-32 MaxPool

CAE-2 Conv-32 MaxPool

Conv-4 MaxPool

Conv-8 MaxPool

Conv-4 UpSample

Conv-8 UpSample

Conv-32 UpSample

Conv-32 UpSample

Conv-tanh

Conv-tanh

In all experiments, the learning rate is 0.01 for SGD and its momentum variants; 0.001 for the algorithms with adaptive learning rates. The same learning rate is used for all algorithms, except SGD and its momentum variants. Because they do not perform well if the learning rate is smaller. Therefore, the hyperparameters which the algorithms give best results on test data are preferred. The momentum α is 0.9 for SGD variants. The decay constant ρ is 0.95 for AdaDelta. Also, the decay constants ρ1 and ρ2 are 0.9 and 0.999 for Adam, AdaMax, Nadam and AMSGrad. The minibatch size is 128 for all models and each of them is trained 50 epochs. In order to compare diﬀerent algorithms, the same parameter initialization is used. The software is based on Theano 2 on a single GTX 1050 Ti GPU.

12 D. Soydaner
4.3. Supervised learning experiments
In order to compare the optimization algorithms, all datasets are classiﬁed for each CNN architecture. The classiﬁcation results for the basic dataset MNIST are given in Table 3. For all architectures, AdaDelta and AdaMax give the best accuracies on test data. On the other side, SGD with Nesterov momentum follows AdaDelta and AdaMax so closely for CNN-3. The worst performance belongs to AdaGrad which is the simplest adaptive learning algorithm. The behaviour of the algorithms during training is shown in Figure 1. SGD begins training with the highest loss for all three architectures while Nadam begins with the lowest. Therefore, it can be said that Nadam makes a better start the training by using the same parameter initialization. Adam and its variants give similar results to each other for CNN-1. However, the diﬀerence between their performances is observed when the network becomes deeper. Besides, AdaDelta performs better than AdaGrad and RMSProp during training.

Table 3. Comparison of the algorithms on MNIST for classiﬁcation.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CNN-1

Test Loss Test Acc.

0.0468

99.00

0.0395

99.26

0.0366

99.34

0.0600

98.57

0.0306

99.50

0.0505

99.26

0.0425

99.35

0.0337

99.37

0.0364

99.32

0.0401

99.24

CNN-2

Test Loss Test Acc.

0.0724

99.03

0.0633

99.18

0.0511

99.35

0.0753

98.79

0.0395

99.49

0.1899

98.70

0.0597

99.00

0.0418

99.51

0.0567

99.19

0.0561

99.28

CNN-3

Test Loss Test Acc.

0.0780

99.24

0.0718

99.26

0.0589

99.41

0.0730

99.08

0.0460

99.40

0.1108

99.32

0.0536

99.29

0.0454

99.42

0.0565

99.16

0.0497

99.36

Fig. 1. The behaviour of algorithms on MNIST during training for three CNN architectures.

A Comparison of Optimization Algorithms for Deep Learning 13
The classiﬁcation results for CIFAR-10 are given in Table 4. As CIFAR-10 includes color images in higher resolutions than MNIST, the performances of algorithms begin to change. AdaMax gives the best accuracies on test data for all architectures. AMSGrad performs close to AdaMax for CNN-1 and CNN-2. On the other side, Adam and SGD with Nesterov momentum follow AdaMax when the neural network becomes deeper. Similar to MNIST results, AdaGrad can not perform well on CIFAR-10. The behaviour of the algorithms during training is shown in Figure 2. While SGD begins training with the highest loss, Adam begins with the lowest. Also, AdaMax seems to perform better for deeper architectures.

Table 4. Comparison of the algorithms on CIFAR-10 for classiﬁcation.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CNN-1

Test Loss Test Acc.

0.9428

67.98

1.2479

74.96

1.2235

76.01

1.2428

56.62

1.6619

74.08

1.2670

75.01

1.2279

76.05

0.8054

76.89

1.2735

76.11

1.1940

76.45

CNN-2

Test Loss Test Acc.

0.8899

70.84

1.3368

76.74

1.2909

77.96

1.1769

59.61

1.7222

76.23

1.1354

76.14

1.1988

76.94

1.1240

79.17

1.3102

76.60

1.2000

77.11

CNN-3

Test Loss Test Acc.

1.0112

68.19

1.1978

78.21

1.1651

79.97

1.1531

60.55

1.4789

78.55

0.9821

74.81

0.9393

79.03

1.0437

81.41

1.0672

78.98

1.0135

78.27

Fig. 2. The behaviour of algorithms on CIFAR-10 during training for three CNN architectures.
By now, the algorithms perform well except SGD and AdaGrad. The change of depth in neural network architectures does not signiﬁcantly aﬀect performance rankings of the algorithms. Therefore, the algorithms are compared on a more complicated dataset than MNIST and CIFAR-10. LFW consists of face images, each of which is a 64 × 64 color image belonging to one of 34 people. It includes

14 D. Soydaner
more classes than MNIST and CIFAR-10 as well as higher resolutions. Also, it has small number of samples per each class. This classiﬁcation task is more diﬃcult for the optimization algorithms. The classiﬁcation results for LFW are given in Table 5. The most obvious result is the decrease of the SGD performance. It can not learn nearly anything during training. Accordingly, its accuracy on test data is the lowest. Similarly, AdaGrad also can not perform well. The best accuracies on test data are obtained by using Adam and RMSProp for CNN-1, AdaDelta for CNN-2 and Adam for CNN-3. In the previous experiments, the performances of SGD momentum variants compete with the adaptive algorithms while training the neural networks on MNIST and CIFAR-10. However, while training the neural networks on LFW, the diﬀerence between them appears and the superiority of the adaptive learning algorithms become clear. Especially AdaDelta, Adam and its variants perform much better.

Table 5. Comparison of the algorithms on LFW for classiﬁcation.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CNN-1

Test Loss Test Acc.

2.8016

22.76

1.0460

74.70

0.9470

78.24

1.6427

58.17

0.8116

82.29

0.7532

84.48

0.6887

84.48

0.8734

77.74

0.7637

83.81

0.6795

82.63

CNN-2

Test Loss Test Acc.

3.1141

19.22

1.1003

71.66

0.8571

79.25

1.8106

52.44

0.6383

84.31

0.7469

81.78

0.8219

79.25

0.9822

74.70

0.7172

82.96

0.7648

82.12

CNN-3

Test Loss Test Acc.

3.1152

19.22

1.0803

73.18

0.9930

75.37

1.7973

52.95

0.8417

80.10

0.9060

75.54

0.7223

82.96

0.8617

78.58

0.8015

80.26

0.6726

81.78

Fig. 3. The behaviour of algorithms on LFW during training for three CNN architectures.

A Comparison of Optimization Algorithms for Deep Learning 15
The behaviour of algorithms during training is shown in Figure 3. While SGD begins training with the highest loss for CNN-1, RMSProp begins with the highest for CNN-2 and CNN-3. On the other side, AdaDelta begins training with a high training loss but ends being one of lowest by giving a good accuracy on test data. As the network becomes deeper, RMSProp becomes worse on both training loss and test accuracy. Also, RMSProp has sometimes sudden peaks during training. This behaviour is more obvious for the deeper networks. While the training loss of Adam and AMSGrad are so close to each other, AdaMax falls behind its variants. In general, AdaDelta, Adam, Nadam and AMSGrad are signiﬁcantly performs better during training. This result can be easily seen especially when the depth of neural network increases.
Lastly, the performances of algorithms are compared on Kaggle Flowers which contains 96 × 96 images. The classiﬁcation results for Kaggle Flowers are reported in Table 6 and the training process is shown in Figure 4. AdaMax and AMSGrad come into prominence according to accuracies on test data. On the contrary, SGD gives the worst results. RMSProp begins the training with the highest loss. However, the initial loss become closer for all algorithms when the neural network becomes deeper. Also, the performance of Nadam decreases when the depth increases. This behaviour of Nadam may arise because of the high resolution of images. AdaDelta, Nadam and RMSProp have sudden peaks during training. On the other side, AdaDelta is not so successful on test data even though it performs well during training. Interestingly, AdaGrad performs well especially for CNN-3. Its accuracy on test data is the closest result to the best which is performed by AdaMax. Therefore, AdaGrad seems to perform well for deeper networks as well as input images with high resolutions.

Table 6. Comparison of the algorithms on Kaggle Flowers for classiﬁcation.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CNN-1

Test Loss Test Acc.

1.0115

59.94

1.5339

67.99

1.5711

68.54

0.8664

67.25

1.7850

68.64

1.9709

67.99

1.9147

69.93

1.2073

69.93

1.7687

67.06

1.8220

70.67

CNN-2

Test Loss Test Acc.

1.0764

59.66

1.3234

69.19

1.4873

68.54

0.8639

66.60

1.8371

65.58

1.9051

66.88

1.4274

68.82

1.1429

71.32

1.9668

67.06

1.6430

71.23

CNN-3

Test Loss Test Acc.

1.3027

44.40

1.5486

63.64

1.7025

64.56

0.9359

66.88

2.1974

61.79

1.9653

63.55

1.8564

62.16

1.0537

70.30

1.4252

64.56

1.9611

61.88

In supervised learning experiments, the positive eﬀect of momentum for SGD is conspicuous especially for CIFAR-10, LFW and Kaggle Flowers datasets. Also, Nesterov momentum is slightly better despite of the momentum and Nesterov momentum results are close at the end of the training. In general, Adam, AdaMax

16 D. Soydaner
and AMSGrad performs well on test data in addition to AdaDelta. Also, the most important advantage of AdaDelta is that it does not require to select the learning rate manually.
Fig. 4. The behaviour of algorithms on Kaggle Flowers during training for three CNN architectures.
Additionally, the algorithms are compared based on the training time they require. As LFW includes a small number of training examples in supervised learning experiments, there is not signiﬁcant diﬀerence between the training time of algorithms for each CNN architecture. On the other hand, MNIST and CIFAR-10 have more training examples than the other two datasets. Therefore, the training time vary across the algorithms as shown in Figure 5. For CNN-1, RMSProp is the fastest among the adaptive algorithms for both datasets. While AdaDelta and AdaMax are slow for MNIST, AMSGrad and Nadam require more time for CIFAR-10. When the dataset becomes more complex, Adam needs more time. AdaGrad performs fast but it can not generalize well on test data. Also, SGD and its momentum variants seem to be fast but their learning rate is bigger in addition to the smaller number of computational steps they have. Nevertheless, they can not generalize well similar to AdaGrad.
RMSProp, Nadam and AdaMax require more time on CIFAR-10 for CNN-2. However, the complexity of dataset does not matter for Adam and AMSGrad. When the neural network becomes deeper, in CNN-3, Nadam and AdaDelta take more time in comparison with the others for both datasets. In order to train CNN-1 and CNN-2, the diﬀerence between the slowest and fastest algorithms is less than 1 minutes. On the other hand, this required time increases to nearly 5 minutes for CNN-3. AdaMax takes minimum time among the Adam variants for the deepest CNN. Additionally, the training time diﬀers when CNN-3 is trained on Kaggle Flowers. As seen in Figure 5, Nadam and AdaDelta take more time. RMSProp requires less time on Kaggle Flowers in comparison with the other adaptive algorithms.

A Comparison of Optimization Algorithms for Deep Learning 17
Fig. 5. (Top) MNIST and CIFAR-10 training time for CNN-1 and CNN-2. (Bottom left) MNIST training time for CNN-3. (Bottom right) Kaggle Flowers training time for CNN-3.
4.4. Unsupervised learning experiments When CAE is used, the task is to reconstruct input images in the output layer. When denoising CAE is used, noisy images are taken as input and the task is to reconstruct clean images in the output layer. Therefore, a gaussian noise matrix is applied to input images before using denoising CAE. In these experiments, the noise factor is 0.25. Both vanilla and denoising autoencoder results on MNIST are given in Table 7. While Adam gives the lowest reconstruction loss for CAE-1 architecture, AdaMax and Nadam give the lowest for CAE-2. But the performances of Adam and its variants are so close to each other. On the other side, Nadam gives the best results for denosing CAE. RMSProp performs well on denoising task in addition to Adam variants. The performance of AdaMax slightly decreases for denoising task. Besides, AdaGrad falls behind the other adaptive algorithms in both tasks. The training loss for MNIST is shown in Figure 6. The behavior of Adam and AMSGrad beginning the training with the lowest loss is in sight.
The results on CIFAR-10, LFW and Kaggle Flowers are similar. Adam and its variants give the best results. Diﬀerent from MNIST results, RMSProp begins training with the lowest reconstruction loss. However, Adam and its variants give better results at the end of the training. The results on Kaggle Flowers are interesting. For CAE-1, the performances of Adam and AMSGrad get worse in denosing task. Accordingly, it can be said that when the size of representation is small, Adam

18 D. Soydaner
and AMSGrad can not perform well on high resolution images. Also, RMSProp performs well together with Adam and its variants on Kaggle Flowers for both unsupervised tasks. Both vanilla and denoising autoencoder results are given in Table 8 for CIFAR-10, Table 9 for LFW and Table 10 for Kaggle Flowers, respectively. Also, the behaviour of the optimization algorithms during training for the unsupervised learning tasks on CIFAR-10 is shown in Figure 7, LFW in Figure 8 and Kaggle Flowers in Figure 9.

Table 7. Comparison of the algorithms on MNIST for reconstruction.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CAE

CAE-1 Loss CAE-2 Loss

0.0175

0.0160

0.0095

0.0074

0.0094

0.0074

0.0113

0.0084

0.0080

0.0053

0.0065

0.0044

0.0053

0.0038

0.0055

0.0036

0.0056

0.0036

0.0054

0.0038

Denoising CAE

CAE-1 Loss CAE-2 Loss

0.0219

0.0186

0.0120

0.0103

0.0120

0.0102

0.0140

0.0125

0.0105

0.0077

0.0087

0.0067

0.0082

0.0067

0.0088

0.0069

0.0080

0.0066

0.0083

0.0068

Fig. 6. The behaviour of algorithms on MNIST during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.

A Comparison of Optimization Algorithms for Deep Learning 19

Table 8. Comparison of the algorithms on CIFAR-10 for reconstruction.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CAE

CAE-1 Loss CAE-2 Loss

0.0145

0.0152

0.0093

0.0091

0.0093

0.0089

0.0115

0.0067

0.0093

0.0088

0.0072

0.0059

0.0055

0.0047

0.0061

0.0047

0.0069

0.0046

0.0060

0.0049

Denoising CAE

CAE-1 Loss CAE-2 Loss

0.0189

0.0158

0.0119

0.0108

0.0119

0.0109

0.0167

0.0113

0.0113

0.0094

0.0102

0.0083

0.0096

0.0075

0.0087

0.0074

0.0087

0.0073

0.0098

0.0075

Fig. 7. The behaviour of algorithms on CIFAR-10 during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
SGD by far the worst for both unsupervised tasks on all datasets. In general, Adam and its variants seem to be much better than the other adaptive gradient methods in addition to SGD and its momentum variants. AdaGrad can not perform well among the adaptive algorithms.

20 D. Soydaner

Table 9. Comparison of the algorithms on LFW for reconstruction.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CAE

CAE-1 Loss CAE-2 Loss

0.0122

0.0136

0.0070

0.0077

0.0070

0.0077

0.0071

0.0046

0.0061

0.0055

0.0044

0.0033

0.0030

0.0018

0.0033

0.0019

0.0036

0.0025

0.0030

0.0018

Denoising CAE

CAE-1 Loss CAE-2 Loss

0.0125

0.0094

0.0073

0.0063

0.0071

0.0063

0.0090

0.0069

0.0086

0.0066

0.0062

0.0053

0.0044

0.0039

0.0052

0.0043

0.0050

0.0045

0.0049

0.0041

Fig. 8. The behaviour of algorithms on LFW during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
In order to compare the results clearly in visual, the reconstructions of test images obtained by all algorithms for CAE-1 architecture are shown in Figure 10. As handwritten images are simpler than the other datasets, all algorithms perform well on MNIST. Even though the reconstructions of SGD are a little blurry, the digits can be seen clearly. An important point is based on the Kaggle Flowers results. SGD and AdaMax can not reconstruct colors properly. Similar results can be seen for CAE-2 as shown in Figure 11. Here, it seems that all adaptive algorithms can reconstruct images with their colors when the size of representation increases.

A Comparison of Optimization Algorithms for Deep Learning 21

Table 10. Comparison of the algorithms on Kaggle Flowers for reconstruction.

Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad

CAE

CAE-1 Loss CAE-2 Loss

0.0359

0.0385

0.0217

0.0220

0.0217

0.0214

0.0298

0.0179

0.0220

0.0180

0.0202

0.0159

0.0144

0.0121

0.0205

0.0133

0.0170

0.0138

0.0145

0.0122

Denoising CAE

CAE-1 Loss CAE-2 Loss

0.0367

0.0347

0.0320

0.0217

0.0230

0.0219

0.0254

0.0234

0.0236

0.0203

0.0216

0.0179

0.0269

0.0148

0.0209

0.0164

0.0208

0.0166

0.0270

0.0146

Fig. 9. The behaviour of algorithms on Kaggle Flowers during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
For denoising task, the reconstructions of all datasets for CAE-1 are given in Figure 12. The small size of representation makes this task more diﬃcult. None of algorithms can reconstruct colors properly on Kaggle Flowers. Also, as it is hard to reconstruct faces and objects, the results are blurry. However, as the representation size increases, the colors given by especially Adam and its variants are much better as shown in Figure 13. Also, while SGD and AdaGrad can reconstruct images with their colors on CIFAR-10 for CAE-1, both algorithms fail in denosing CAE-1.

22 D. Soydaner Fig. 10. Reconstructions of test images using CAE-1. Fig. 11. Reconstructions of test images using CAE-2.

A Comparison of Optimization Algorithms for Deep Learning 23 Fig. 12. Reconstructions of test images using denoising CAE-1. Fig. 13. Reconstructions of test images using denoising CAE-2.

24 D. Soydaner
Lastly, the algorithms are compared based on the training time they require. Similar to supervised learning experiments, training time does not vary signiﬁcantly on LFW and Kaggle Flowers. However, training time on MNIST and CIFAR-10 for unsupervised learning experiments are shown in Figure 14. In general, Adam and Nadam need more time than other algorithms to reconstruct images. Similar to supervised results, when the dataset becomes more complex, they need more time. Also, AdaMax requires more time on CIFAR-10 as especially seen in CAE-1. All algorithms need more time for CAE-2 relative to CAE-1. Also, denoising task cause training time mostly decrease on CIFAR-10 and increase on MNIST. Even though SGD and its momentum variants train autoencoders fast by using the learning rate they perform best, they can not generalize well on test data.
Fig. 14. (Top) MNIST and CIFAR-10 training time for CAE-1 and CAE-2. (Bottom) MNIST and CIFAR-10 training time for denoising CAE-1 and CAE-2.
5. Conclusion In this study, the most commonly used optimization algorithms in deep learning are examined. The diﬀerences between their working principles are summarized considering pros and cons for each of them. In this context, the importance of adaptive learning algorithms is highlighted. The performances of algorithms are compared on four image datasets empirically. The behaviour of the algorithms

A Comparison of Optimization Algorithms for Deep Learning 25
during training is observed according to the eﬀects of diﬀerent image resolutions and neural network architectures. Because of adaptive methods are mostly superior and computationally eﬃcient, their results seem to be better for both tasks. Still, the research to ﬁnd better adaptive methods continues for deep learning.
References
1. E. Alpaydın, Introduction to Machine Learning, The MIT Press, 2014. 2. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley and Y. Bengio, Theano: A CPU and GPU math compiler in Python, In Proc. 9th Python in Science Conference, Austin, Texas, 2010, pp. 3–10. 3. L. Bottou, F.E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, Siam Review, 60(2) (2018) 223–311. 4. Y. Dauphin, H. De Vries, J. Chung and Y. Bengio, RMSProp and equilibrated adaptive learning rates for non-convex optimization, CoRR abs/1502.04390, 2015. 5. T. Dozat, Incorporating Nesterov Momentum into Adam, International Conference on Learning Representations, San Juan, Puerto Rico, 1(2016) 2013-2016. 6. J. Duchi, E. Hazan and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011) 2121– 2159. 7. A. Geron, Hands-on machine learning with Scikit-learn and Tensorﬂow, O’Reilly, 2017. 8. I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, The MIT Press, 2016. 9. H. Hayashi, J. Koushik and G. Neubig, Eve: A gradient based optimization method with locally and globally adaptive learning rates, arXiv preprint arXiv:1611.01505, 2016. 10. G. Hinton, Neural networks for machine learning, Coursera, video lectures, 2012. 11. G.B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled Faces in the Wild: A database for studying face recognition in unconstrained environments, Technical Report, University of Massachusetts, Amherst, 2007, pp. 07–49. 12. H. Huang, C. Wang and B. Dong, Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate, arXiv preprint arXiv:1805.07557, 2018. 13. D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014. 14. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, 2009. 15. A. Krizhevsky, I. Sutskever and G. Hinton, ImageNet classiﬁcation with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25(2012) 1097-1105. 16. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow and A.Y. Ng, On optimization methods for deep learning, in Proc. 28th International Conference on Machine Learning, Bellevue, Washington, USA, 2011, pp. 265–272. 17. Y. LeCun, L. Bottou, Y. Bengio and P. Haﬀner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86(11) (1998) 2278–2324. 18. L. Luo, Y. Xiong, Y. Liu and X. Sun, Adaptive gradient methods with dynamic bound of learning rate, International Conference on Learning Representations, New Orleans, 2019. 19. A. Mamaev, Flowers Recognition, https://www.kaggle.com/alxmamaev/ﬂowersrecognition, 2018. 20. M.C. Mukkamala and M. Hein, Variants of RMSProp and Adagrad with logarithmic

26 D. Soydaner
regret bounds, in Proc. 34th International Conference on Machine Learning, Sydney, Australia, 70 (2017) 2545–2553. 21. Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2), Soviet Mathematics Doklady, 27 (1983) 372–376. 22. B.T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4(5) (1964) 1–17. 23. S.J. Reddi, S. Kale and S. Kumar, On the convergence of Adam and beyond, International Conference on Learning Representations, Vancouver, Canada, 2018. 24. H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 22 (1951) 400–407. 25. S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747, 2016. 26. N. Shazeer and M. Stern, Adafactor: Adaptive learning rates with sublinear memory cost, arXiv preprint arXiv:1804.04235, 2018. 27. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, International Conference on Learning Representations, San Diego, CA, USA, 2015. 28. D. Soham, M. Anirbit and U. Enayat, Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration, arXiv preprint arXiv:1807.06766, 2018. 29. I. Sutskever, J. Martens, G. Dahl and G. Hinton, On importance of initialization and momentum in deep learning, International Conference on Machine Learning, Atlanta, USA, 2013, pp. 1139–1147. 30. M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar, Adaptive methods for nonconvex optimization, Advances in Neural Information Processing Systems, 31 (2018), pp. 9793–9803. 31. M.D. Zeiler, ADADELTA: An adaptive learning rate method, arXiv preprint arXiv:1212.5701, 2012.
Derya Soydaner received her B.Sc., M.Sc. and Ph.D. degrees in statistics from Mimar Sinan Fine Arts University, Istanbul, Turkey in 2012, 2014 and 2018, respectively. She mainly studied on neural networks throughout her entire postgraduate education. Currently, she is a faculty member in the Department of Statistics at Mimar Sinan Fine Arts University. Her research interests include pattern recognition, machine learning and image processing.