Files
DissLiteratur/storage/8E7VG78G/.zotero-ft-cache
Johannes Paehr c4354c0441 init
2025-10-18 15:35:31 +02:00

1376 lines
48 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
arXiv:2007.14166v1 [cs.LG] 28 Jul 2020
International Journal of Pattern Recognition and Artificial Intelligence c World Scientific Publishing Company
A Comparison of Optimization Algorithms for Deep Learning
Derya Soydaner Statistics Department, Mimar Sinan Fine Arts University
I˙stanbul, 34380,Turkey derya.soydaner@msgsu.edu.tr
In recent years, we have witnessed the rise of deep learning. Deep neural networks have proved their success in many areas. However, the optimization of these networks has become more difficult as neural networks going deeper and datasets becoming bigger. Therefore, more advanced optimization algorithms have been proposed over the past years. In this study, widely used optimization algorithms for deep learning are examined in detail. To this end, these algorithms called adaptive gradient methods are implemented for both supervised and unsupervised tasks. The behaviour of the algorithms during training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers and Labeled Faces in the Wild are compared by pointing out their differences against basic optimization algorithms.
Keywords: Adaptive gradient methods; optimization; deep learning; image processing.
1. Introduction Adaptive gradient methods have been widely used in deep learning. Although one of the most preferred algorithms has been stochastic gradient descent (SGD) for many years, it has difficulties to overcome serious problems such as ill-conditioning and time necessity for large-scale datasets when training deep neural networks. It requires manual tuning of learning rate and difficult to parallelize 16. Thus, the problems of SGD caused the invention of more advanced algorithms. Nowadays, the optimization algorithms used for deep learning adapt their learning rates during training. Basically, the adaptive gradient methods adjust the learning rate for each parameter. Therefore, when the gradients for some parameters are large, their learning rates are reduced or vice versa.
Until recently, many adaptive methods have been proposed and they become the most commonly used alternatives to SGD. In addition to their high performance on training deep models, another advantage is they are first-order optimization algorithms just as SGD. Thus, they are computationally efficient for training deep neural networks. This work aims to present the most widely used adaptive optimization algorithms that are proven their superiority and compare their working principles. To this end, image processing that is one of the most important areas of deep learning is handled. Firstly, the effects of adaptive gradient methods are observed for image classification task by using convolutional neural networks (CNNs). Secondly,
1
2 D. Soydaner
as an unsupervised task, convolutional autoencoders (CAEs) that is one of the quintessential examples of unsupervised learning are used for image reconstruction. Besides, the effects of algorithms are examined by using denoising autoencoders. By this way, the behaviours of the algorithms during training are analyzed in addition to their performances for both supervised and unsupervised learning tasks.
The rest of the paper is organized as follows. In Section 2, studies about adaptive gradient methods and the most recent variants of them are reviewed briefly. In Section 3, widely used optimization algorithms in deep learning are explained by showing their update rules and solutions for challenges of training deep networks. SGD and its momentum variants are also mentioned in this section. The experiments and comparative results are in Section 4 and conclusion in Section 5.
2. Related Work
In deep learning literature, working principles and performance analysis of optimization algorithms are widely studied. For example, theoretical guarantees of convergence to criticality for RMSProp and Adam are presented in the setting of optimizing a non-convex objective 28. They design experiments to empirically study the convergence and generalization properties of RMSProp and Adam against Nesterov´s accelerated gradient method. In another study, conjugate gradient, SGD and limited memory BFGS algorithms are compared 16. A review is presented on numerical optimization algorithms in the context of machine learning applications 3. Additionally, similar to this work, an overview of gradient optimization algorithms is summarized 25.
In this study, most widely used optimization algorihms are examined in the context of deep learning. On the other side, new variants of adaptive methods still have been proposed more recently. For example, new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, are proposed 18. They employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD. Also, a new algorithm that adapts the learning rate locally for each parameter separately, and also globally for all parameters together is presented 9. Another new algorithm, called Nostalgic Adam (NosAdam), which places bigger weights on the past gradients than the recent gradients when designing the adaptive learning rate is introduced 12. In another study, two variants called SC-Adagrad and SC-RMSProp are proposed 20. A new adaptive optimization algorithm called YOGI is presented 30. It controls the increase in effective learning rate. A novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner is developed 4. Also, a new algorithm called Adafactor is presented 26. Instead of updating parameters scaling by the inverse square roots of exponential moving averages of squared past gradients, Adafactor maintains only the per-row and per-column sums of the moving averages, and estimates the per-parameter second moments based on these sums.
A Comparison of Optimization Algorithms for Deep Learning 3
3. Optimization Algorithms with Adaptive Learning Rates
The choice of the algorithm to optimize a neural network is one of the most important steps. In machine learning, there are three main kinds of optimization methods. First one is called batch or deterministic gradient methods that process all training examples simultaneously in a large batch. Second one is called stochastic or online methods that use only one example at a time. Today, most algorithms are a blend of the two. During training, they use only a part of training set at each epoch. These algorithms are called minibatch methods. In deep learning era, minibatch methods are mostly preferred for two major reasons. Firstly, they accelerate the training of neural networks. Secondly, as the minibatches are selected randomly and they are independent, an unbiased estimate of the expected gradient can be computed 8.
In this paper, the most widely used minibatch-based adaptive algorithms are examined in detail. Besides, SGD that had been preferred conventionally for a long time is explained alongside its momentum variants, briefly.
3.1. Stochastic gradient descent
Basically, SGD 24 follows the gradient of randomly selected minibatches downhill. In order to train a neural network using SGD, firstly, the gradient estimate is computed by using a loss function. Then, the update at iteration k is applied for parameters θ. The calculations for each minibatch of m examples from the training set x(1), ..., x(m) with corresponding targets y(i) are as follows:
1 gˆ ← m ∇θ
L(f (x(i); θ), y(i))
(1)
i
θ ← θ kgˆ
(2)
Here, the learning rate k is a very important hyperparameter. The magnitude of the update depends on the learning rate. If it is too large, updates depend too much on recent instances. If it is small, many updates may be need for convergence 1. This hyperparameter can be chosen by trial and error. One way is to choose one of the several learning rates that results in the smallest loss function value. This is called line search. Another way is to monitor the first several epochs and use a learning rate that is higher than the best performing learning rate. In Equation 2, the learning rate is denoted as k at iteration k because in practice, it is necessary to gradually decrease the learning rate over time 8.
3.1.1. Stochastic gradient descent with momentum
SGD has difficulty to reach global optimum because of its tendency to oscillate especially on steep surface curves. Noisy or small gradients may also be problematic. The method of momentum 22 is designed to accelerate learning in such cases. It
4 D. Soydaner
aims primarily to solve two problems: Poor conditioning of the Hessian matrix and variance in the stochastic gradient. The idea behind this algorithm is to take a running average by incorporating the previous update in the current change as if there is a momentum due to previous updates 1. When SGD is used with momentum, it can converge faster as well as reduced oscillation.
SGD with momentum uses a variable v called velocity. The velocity is the direction and speed at which the parameters move through parameter space. It is set to an exponentially decaying average of the negative gradient. Also, SGD with momentum requires a new hyperparameter α ∈ 0, 1 called momentum parameter that determines how quickly the contributions of previous gradients exponentially decay. The parameters are updated after the velocity update is computed:
1 v ← αv m ∇θ
m
L(f (x(i); θ), y(i))
(3)
i=1
θ ← θ+v
(4)
The velocity v accumulates the gradient elements. The larger α is relative to , the more previous gradients affect the current direction. Common values of α used in practice are 0.5, 0.9 and 0.99 8. However, a disadvantage of this algorithm is the requirement of momentum hyperparameter in addition to the learning rate.
3.1.2. Stochastic gradient descent with Nesterov momentum
SGD with Nesterov momentum 29 is proposed as a variant of the standard momentum by taking inspiration from Nesterov´s accelerated gradient method 21. The idea is to measure the gradient of the loss function not at the local position but slightly ahead in the direction of the momentum 7. This algorithm evaluates the gradient after the current velocity is applied with Nesterov momentum. Therefore, SGD with Nesterov momentum begins with an interim update for a minibatch 8:
θ˜ ← θ + αv
(5)
Then, gradient is computed at the interim point. By using this gradient, velocity update is computed. Finally, the parameters are updated:
g←
1 m ∇θ˜
m
L
f (x(i); θ˜), y(i)
(6)
i=1
v ← αv g
(7)
θ ← θ+v
(8)
A Comparison of Optimization Algorithms for Deep Learning 5
3.2. AdaGrad
One of the optimization algorithms that individually adapts the learning rates of model parameters is AdaGrad 6. The parameters with the largest partial derivative of the loss have a rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate 8. This is performed by using all the historical squared values of the gradient.
AdaGrad uses an additional variable r for gradient accumulation. In the beginning of this algorithm, the gradient accumulation variable is initialized to zero and gradient is computed for a minibatch:
1 g ← m ∇θ
L(f (x(i); θ), y(i))
(9)
i
By using this gradient, the squared gradient is accumulated. Then, the update is computed by scaling learning rates of all parameters inversely proportional to the square root of the sum of all the historical squared values of the gradient. Finally, this update is applied to the model parameters:
r←r+g g
(10)
∆θ ← √ g
(11)
δ+ r
θ ← θ + ∆θ
(12)
where is the global learning rate and δ is a small constant for numerical stability. However, AdaGrad has serious disadvantages. Generally, it performs well for
simple quadratic problems, but it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum 7. Also, for training deep neural networks, the accumulation of squared gradients from the beginning of training can result in an excessive decrease in the effective learning rate. AdaGrad performs well for some but not all deep learning models 8.
3.3. AdaDelta
The underlying idea of AdaDelta algorithm is to improve the two main drawbacks of AdaGrad: The continual decay of learning rates throughout training and the need for a manually selected global learning rate. To this end, AdaDelta restricts the window of past gradients to be some fixed size w instead of accumulating the sum of squared gradients over all time. As mentioned in the previous section, AdaGrad accumulates the squared gradients from each iteration starting at the beginning
6 D. Soydaner
of training. This accumulated sum continues to grow during training, effectively shrinking the learning rate on each dimension. After many iterations, the learning rate becomes infinitesimally small. With the windowed accumulation AdaGrad becomes a local estimate using recent gradients instead of accumulating to infinity. Thus, learning continues to make progress even after many iterations of updates have been done 31.
Since storing w previous squared gradients is inefficient, AdaDelta implements this accumulation as an exponentially decaying average of the squared gradients. Assuming this running average is E g2 t at time t, gradient accumulation is computed as follows:
E g2 t = ρE g2 t1 + 1 ρ gt2
(13)
where ρ is a decay constant similar to that used in the momentum method. Since it is required the square root of this quantity in the parameter updates, this effectively becomes the root mean square (RMS) of previous squared gradients up to time t:
RM S g t = E g2 t + δ
(14)
where δ is again a small constant. Based on this RMS, the parameter update is computed, updates are accumulated and the parameters are updated, respectively:
∆θt
=
RM S ∆θ
RM S g
t1 t
gt
(15)
E ∆θ2 t = ρE ∆θ2 t1 + 1 ρ ∆θt2
(16)
θt+1 = θt + ∆θt
(17)
The advantage of AdaDelta is that it requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architectures, various data modalities and selection of hyperparameters 31.
3.4. RMSProp
Another algorithm that modifies AdaGrad is RMSProp 10. It is proposed to perform better in the nonconvex setting by changing the gradient accumulation into an exponentially weighted moving average. As mentioned in Section 3.2, AdaGrad shrinks the learning rate according to the entire history of the squared gradient. Instead, RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl 8.
A Comparison of Optimization Algorithms for Deep Learning 7
In order to implement RMSProp, squared gradient is accumulated after computing gradient:
r ← ρr + (1 ρ)g g
(18)
where ρ is the decay rate. Then parameter update is computed and applied as follows:
∆θ =
g
(19)
δ+r
θ ← θ + ∆θ
(20)
3.5. Adam
Adam is one of the most widely used optimization algorithms in deep learning. The name Adam is derived from adaptive moment estimation because it computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. Adam combines the advantages of AdaGrad which works well with sparse gradients and RMSProp which works well in online and non-stationary settings 13.
There are some important properties of Adam. Firstly, momentum is incorporated directly as an estimate of the first-order moment of the gradient. Also, Adam includes bias corrections to the estimate of both the first-order moments and the second-order moments to account for their initialization at the origin 8. The algorithm updates exponential moving averages of the gradient mt and the squared gradient ut where the hyperparameters ρ1 ve ρ2 control the exponential decay rates of these moving averages. The moving averages themselves are estimates of the first moment (the mean) and the second raw moment (the uncentered variance) of the gradient 13.
Adam algorithm requires first and second moment variables m and u. After computing gradient, biased first and second moment estimates are updated at time step t respectively:
mt ← ρ1mt1 + (1 ρ1)gt
(21)
ut ← ρ2ut1 + (1 ρ2)g g
(22)
Then, bias is corrected in first and second moments. By using the corrected moment estimates parameter updates are calculated and applied:
mˆ t
1
mt ρt1
(23)
8 D. Soydaner
uˆt
ut 1 ρt2
(24)
∆θ = √ mˆ t
(25)
uˆt + δ
θt ← θt1 + ∆θ
(26)
Adam has many advantages. First of all, it requires a little tuning for the learning rate. Also, it is a method that is straightforward to implement and invariant to diagonal rescaling of gradients. It is computationally efficient as well as a little memory requirements. Besides, Adam is appropriate for non-stationary objectives and problems with very noisy and sparse gradients 13.
3.6. AdaMax
AdaMax is proposed as an extension of Adam. It is a variant of Adam based on the infinity norm. In Adam, update rule for individual weights is to scale their gradients inversely proportional to a L2 norm of their individual current and past gradients. AdaMax is based on the idea that L2 norm based update rule can be generalized to a Lp norm based update rule.
AdaMax algorithm begins with calculating gradients w.r.t. stochastic objective at time step t, as usual. Then, biased first moment estimate and exponentially weighted infinity norm are computed. By using them, the model parameters are updated. These steps are defined below, respectively:
gt ← ∇θft θt1
(27)
mt ← ρ1mt1 + 1 ρ1 gt
(28)
γt ← max ρ2γt1, |gt|
(29)
θt ← θt1 / 1 ρt1 mt/γt
(30)
It is showed that if AdaMax is preferred as optimization algorithm, there is no need to correct for initialization bias. Besides, the magnitude of parameter updates has a simpler bound with AdaMax than Adam 13.
A Comparison of Optimization Algorithms for Deep Learning 9
3.7. Nadam
Nadam (Nesterov-accelerated adaptive moment estimation) modifies Adams momentum component with Nesterov´s accelerated gradient. Thus, Nadam aims to improve the speed of convergence and the quality of the learned models 5.
Similar to Adam, after computing gradient, first and second moment variables are updated as in Equations 31 and 32. Then, corrected moments are computed and parameters are updated as in following equations:
mt ← ρtmt1 + 1 ρt gt
(31)
ut ← vut1 + 1 v gt2
t+1
t
mˆρt+1mt/ 1 ρi + 1 ρt gt/ 1 ρi
i=1
i=1
uˆ ← vut/ 1 vt
(32) (33) (34)
θt
θt1
√t uˆt +
δ
mˆ t
(35)
3.8. AMSGrad
Another an exponential moving average variant is AMSGrad 23. The purpose of developing AMSGrad is to guarantee the convergence while preserving the benefits of Adam and RMSProp. In AMSGrad algorithm, the first and second moment variables are updated as in Equations 36 and 37. The key difference between AMSGrad and Adam is shown in Equation 38. Here, AMSGrad maintains the maximum of all ut until the present time step and uses this maximum value for normalizing the running average of the gradient instead of ut in Adam. By doing this, AMSGrad results in a non-increasing step size. Finally, the parameters are updated as in Equation 39. Here, Uˆt indicates diag uˆt .
mt ← ρt1mt1 + (1 ρt1)gt
(36)
ut ← ρ2ut1 + (1 ρ2)gt2
(37)
uˆt ← max uˆt1, ut
(38)
10 D. Soydaner
θt+1 ←
F, Uˆt
θt
tmt/
uˆt
(39)
On the other side, the difference of AMSGrad from Adam and AdaGrad, it
neither increases nor decreases the learning rate and furthermore, decreases ut which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations 23.
4. Experiments
4.1. Datasets
The performances of optimization algorithms are evaluated on four image datasets, namely, MNIST 17, CIFAR-10 14, Kaggle Flowers 19 and Labeled Faces in the Wild (LFW) 11. The well-known dataset MNIST includes 60000 training and 10000 test examples each of which is a 28×28 gray scale handwritten digit image. CIFAR-10 is composed of 50000 training and 10000 test examples. They are 32 × 32 color images belonging to 10 classes. Kaggle Flowers contains 4242 images of 5 different types of flowers. LFW includes 13233 face images belonging to 5749 people. In supervised learning experiments, LFW classes that have at least 30 images are chosen. LFW has 34 classes meet this requirement, thus 1777 face images belong to 34 people are used for classification. On the other side, all the 13233 images are used in unsupervised learning experiments. Besides, LFW and Kaggle flowers include color images in different sizes. In this study, LFW and Kaggle Flowers are scaled to a size of 64 × 64 and 96 × 96, respectively. Also, they are randomly divided into two subsets as 0.75 for training and 0.25 for testing.
4.2. Experimental setting
In this study, both supervised and unsupervised learning tasks are handled. Firstly, three different CNN architectures are used for classification task. First one includes three convolutional and two fully-connected (FC) layers similar to wellknown LeNet-5 17. Second one has five convolutional and three FC layers that are organized in the architecture similar to AlexNet 15. Last one has seven convolutional layers that are stacked VGG-style 27. These architectures are shown in Table 1. All of them have convolutional layers that include filters with 3 × 3 kernels in addition to 2 × 2 max-pooling layers. In order to avoid overfitting, weight decay of 1e 4 is applied to convolutional layers. Additionally, dropout in ratio 0.50 is applied to the FC layers preceding the output layer. All layers use ReLU as activation function except the output layer. The loss function is categorical cross entropy.
Secondly, convolutional autoencoders (CAE) are preferred for the unsupervised learning task. In general, a CAE comprises of two parts called encoder and decoder. While the encoder generates a representation of input data, the decoder takes this representation as input and reconstructs it in the output. When deciding
A Comparison of Optimization Algorithms for Deep Learning 11
the architecture of the autoencoder, the size of the representation is very important. Because when the size of representation becomes smaller, the reconstruction of the images becomes harder. Therefore, two architectures are preferred for the unsupervised learning experiments as described in Table 2. The encoder and decoder have symmetric convolutional-deconvolutional layers. Here, the difference between the encoder and decoder is that upsampling layers are used in decoder instead of maxpooling layers. The first architecture, CAE-1, reduces the size of representation to one quarter while the second one, CAE-2, reduces it to one half. For example, the representation of 64 × 64 images of LFW is 1024 dimensional in CAE-1 and 2048 dimensional in CAE-2. For unsupervised learning experiments, the reconstruction loss is mean squared error and the activation function is tanh in the output layer. The rest of the layers use ReLU.
Table 1. CNN architectures used in supervised learning experiments.
CNN-1 Conv-32 MaxPool
Conv-64 MaxPool
Conv-128 MaxPool
FC-128 FC-softmax
CNN-2 Conv-32 MaxPool
Conv-64 MaxPool
Conv-128 Conv-128 Conv-128 MaxPool
FC-128 FC-256 FC-softmax
CNN-3 Conv-32 Conv-32 MaxPool
Conv-64 Conv-64 MaxPool
Conv-128 Conv-128 Conv-128 MaxPool
FC-128 FC-256 FC-softmax
Table 2. CAE architectures used in unsupervised learning experiments.
CAE-1 Conv-32 MaxPool
CAE-2 Conv-32 MaxPool
Conv-4 MaxPool
Conv-8 MaxPool
Conv-4 UpSample
Conv-8 UpSample
Conv-32 UpSample
Conv-32 UpSample
Conv-tanh
Conv-tanh
In all experiments, the learning rate is 0.01 for SGD and its momentum variants; 0.001 for the algorithms with adaptive learning rates. The same learning rate is used for all algorithms, except SGD and its momentum variants. Because they do not perform well if the learning rate is smaller. Therefore, the hyperparameters which the algorithms give best results on test data are preferred. The momentum α is 0.9 for SGD variants. The decay constant ρ is 0.95 for AdaDelta. Also, the decay constants ρ1 and ρ2 are 0.9 and 0.999 for Adam, AdaMax, Nadam and AMSGrad. The minibatch size is 128 for all models and each of them is trained 50 epochs. In order to compare different algorithms, the same parameter initialization is used. The software is based on Theano 2 on a single GTX 1050 Ti GPU.
12 D. Soydaner
4.3. Supervised learning experiments
In order to compare the optimization algorithms, all datasets are classified for each CNN architecture. The classification results for the basic dataset MNIST are given in Table 3. For all architectures, AdaDelta and AdaMax give the best accuracies on test data. On the other side, SGD with Nesterov momentum follows AdaDelta and AdaMax so closely for CNN-3. The worst performance belongs to AdaGrad which is the simplest adaptive learning algorithm. The behaviour of the algorithms during training is shown in Figure 1. SGD begins training with the highest loss for all three architectures while Nadam begins with the lowest. Therefore, it can be said that Nadam makes a better start the training by using the same parameter initialization. Adam and its variants give similar results to each other for CNN-1. However, the difference between their performances is observed when the network becomes deeper. Besides, AdaDelta performs better than AdaGrad and RMSProp during training.
Table 3. Comparison of the algorithms on MNIST for classification.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CNN-1
Test Loss Test Acc.
0.0468
99.00
0.0395
99.26
0.0366
99.34
0.0600
98.57
0.0306
99.50
0.0505
99.26
0.0425
99.35
0.0337
99.37
0.0364
99.32
0.0401
99.24
CNN-2
Test Loss Test Acc.
0.0724
99.03
0.0633
99.18
0.0511
99.35
0.0753
98.79
0.0395
99.49
0.1899
98.70
0.0597
99.00
0.0418
99.51
0.0567
99.19
0.0561
99.28
CNN-3
Test Loss Test Acc.
0.0780
99.24
0.0718
99.26
0.0589
99.41
0.0730
99.08
0.0460
99.40
0.1108
99.32
0.0536
99.29
0.0454
99.42
0.0565
99.16
0.0497
99.36
Fig. 1. The behaviour of algorithms on MNIST during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 13
The classification results for CIFAR-10 are given in Table 4. As CIFAR-10 includes color images in higher resolutions than MNIST, the performances of algorithms begin to change. AdaMax gives the best accuracies on test data for all architectures. AMSGrad performs close to AdaMax for CNN-1 and CNN-2. On the other side, Adam and SGD with Nesterov momentum follow AdaMax when the neural network becomes deeper. Similar to MNIST results, AdaGrad can not perform well on CIFAR-10. The behaviour of the algorithms during training is shown in Figure 2. While SGD begins training with the highest loss, Adam begins with the lowest. Also, AdaMax seems to perform better for deeper architectures.
Table 4. Comparison of the algorithms on CIFAR-10 for classification.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CNN-1
Test Loss Test Acc.
0.9428
67.98
1.2479
74.96
1.2235
76.01
1.2428
56.62
1.6619
74.08
1.2670
75.01
1.2279
76.05
0.8054
76.89
1.2735
76.11
1.1940
76.45
CNN-2
Test Loss Test Acc.
0.8899
70.84
1.3368
76.74
1.2909
77.96
1.1769
59.61
1.7222
76.23
1.1354
76.14
1.1988
76.94
1.1240
79.17
1.3102
76.60
1.2000
77.11
CNN-3
Test Loss Test Acc.
1.0112
68.19
1.1978
78.21
1.1651
79.97
1.1531
60.55
1.4789
78.55
0.9821
74.81
0.9393
79.03
1.0437
81.41
1.0672
78.98
1.0135
78.27
Fig. 2. The behaviour of algorithms on CIFAR-10 during training for three CNN architectures.
By now, the algorithms perform well except SGD and AdaGrad. The change of depth in neural network architectures does not significantly affect performance rankings of the algorithms. Therefore, the algorithms are compared on a more complicated dataset than MNIST and CIFAR-10. LFW consists of face images, each of which is a 64 × 64 color image belonging to one of 34 people. It includes
14 D. Soydaner
more classes than MNIST and CIFAR-10 as well as higher resolutions. Also, it has small number of samples per each class. This classification task is more difficult for the optimization algorithms. The classification results for LFW are given in Table 5. The most obvious result is the decrease of the SGD performance. It can not learn nearly anything during training. Accordingly, its accuracy on test data is the lowest. Similarly, AdaGrad also can not perform well. The best accuracies on test data are obtained by using Adam and RMSProp for CNN-1, AdaDelta for CNN-2 and Adam for CNN-3. In the previous experiments, the performances of SGD momentum variants compete with the adaptive algorithms while training the neural networks on MNIST and CIFAR-10. However, while training the neural networks on LFW, the difference between them appears and the superiority of the adaptive learning algorithms become clear. Especially AdaDelta, Adam and its variants perform much better.
Table 5. Comparison of the algorithms on LFW for classification.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CNN-1
Test Loss Test Acc.
2.8016
22.76
1.0460
74.70
0.9470
78.24
1.6427
58.17
0.8116
82.29
0.7532
84.48
0.6887
84.48
0.8734
77.74
0.7637
83.81
0.6795
82.63
CNN-2
Test Loss Test Acc.
3.1141
19.22
1.1003
71.66
0.8571
79.25
1.8106
52.44
0.6383
84.31
0.7469
81.78
0.8219
79.25
0.9822
74.70
0.7172
82.96
0.7648
82.12
CNN-3
Test Loss Test Acc.
3.1152
19.22
1.0803
73.18
0.9930
75.37
1.7973
52.95
0.8417
80.10
0.9060
75.54
0.7223
82.96
0.8617
78.58
0.8015
80.26
0.6726
81.78
Fig. 3. The behaviour of algorithms on LFW during training for three CNN architectures.
A Comparison of Optimization Algorithms for Deep Learning 15
The behaviour of algorithms during training is shown in Figure 3. While SGD begins training with the highest loss for CNN-1, RMSProp begins with the highest for CNN-2 and CNN-3. On the other side, AdaDelta begins training with a high training loss but ends being one of lowest by giving a good accuracy on test data. As the network becomes deeper, RMSProp becomes worse on both training loss and test accuracy. Also, RMSProp has sometimes sudden peaks during training. This behaviour is more obvious for the deeper networks. While the training loss of Adam and AMSGrad are so close to each other, AdaMax falls behind its variants. In general, AdaDelta, Adam, Nadam and AMSGrad are significantly performs better during training. This result can be easily seen especially when the depth of neural network increases.
Lastly, the performances of algorithms are compared on Kaggle Flowers which contains 96 × 96 images. The classification results for Kaggle Flowers are reported in Table 6 and the training process is shown in Figure 4. AdaMax and AMSGrad come into prominence according to accuracies on test data. On the contrary, SGD gives the worst results. RMSProp begins the training with the highest loss. However, the initial loss become closer for all algorithms when the neural network becomes deeper. Also, the performance of Nadam decreases when the depth increases. This behaviour of Nadam may arise because of the high resolution of images. AdaDelta, Nadam and RMSProp have sudden peaks during training. On the other side, AdaDelta is not so successful on test data even though it performs well during training. Interestingly, AdaGrad performs well especially for CNN-3. Its accuracy on test data is the closest result to the best which is performed by AdaMax. Therefore, AdaGrad seems to perform well for deeper networks as well as input images with high resolutions.
Table 6. Comparison of the algorithms on Kaggle Flowers for classification.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CNN-1
Test Loss Test Acc.
1.0115
59.94
1.5339
67.99
1.5711
68.54
0.8664
67.25
1.7850
68.64
1.9709
67.99
1.9147
69.93
1.2073
69.93
1.7687
67.06
1.8220
70.67
CNN-2
Test Loss Test Acc.
1.0764
59.66
1.3234
69.19
1.4873
68.54
0.8639
66.60
1.8371
65.58
1.9051
66.88
1.4274
68.82
1.1429
71.32
1.9668
67.06
1.6430
71.23
CNN-3
Test Loss Test Acc.
1.3027
44.40
1.5486
63.64
1.7025
64.56
0.9359
66.88
2.1974
61.79
1.9653
63.55
1.8564
62.16
1.0537
70.30
1.4252
64.56
1.9611
61.88
In supervised learning experiments, the positive effect of momentum for SGD is conspicuous especially for CIFAR-10, LFW and Kaggle Flowers datasets. Also, Nesterov momentum is slightly better despite of the momentum and Nesterov momentum results are close at the end of the training. In general, Adam, AdaMax
16 D. Soydaner
and AMSGrad performs well on test data in addition to AdaDelta. Also, the most important advantage of AdaDelta is that it does not require to select the learning rate manually.
Fig. 4. The behaviour of algorithms on Kaggle Flowers during training for three CNN architectures.
Additionally, the algorithms are compared based on the training time they require. As LFW includes a small number of training examples in supervised learning experiments, there is not significant difference between the training time of algorithms for each CNN architecture. On the other hand, MNIST and CIFAR-10 have more training examples than the other two datasets. Therefore, the training time vary across the algorithms as shown in Figure 5. For CNN-1, RMSProp is the fastest among the adaptive algorithms for both datasets. While AdaDelta and AdaMax are slow for MNIST, AMSGrad and Nadam require more time for CIFAR-10. When the dataset becomes more complex, Adam needs more time. AdaGrad performs fast but it can not generalize well on test data. Also, SGD and its momentum variants seem to be fast but their learning rate is bigger in addition to the smaller number of computational steps they have. Nevertheless, they can not generalize well similar to AdaGrad.
RMSProp, Nadam and AdaMax require more time on CIFAR-10 for CNN-2. However, the complexity of dataset does not matter for Adam and AMSGrad. When the neural network becomes deeper, in CNN-3, Nadam and AdaDelta take more time in comparison with the others for both datasets. In order to train CNN-1 and CNN-2, the difference between the slowest and fastest algorithms is less than 1 minutes. On the other hand, this required time increases to nearly 5 minutes for CNN-3. AdaMax takes minimum time among the Adam variants for the deepest CNN. Additionally, the training time differs when CNN-3 is trained on Kaggle Flowers. As seen in Figure 5, Nadam and AdaDelta take more time. RMSProp requires less time on Kaggle Flowers in comparison with the other adaptive algorithms.
A Comparison of Optimization Algorithms for Deep Learning 17
Fig. 5. (Top) MNIST and CIFAR-10 training time for CNN-1 and CNN-2. (Bottom left) MNIST training time for CNN-3. (Bottom right) Kaggle Flowers training time for CNN-3.
4.4. Unsupervised learning experiments When CAE is used, the task is to reconstruct input images in the output layer. When denoising CAE is used, noisy images are taken as input and the task is to reconstruct clean images in the output layer. Therefore, a gaussian noise matrix is applied to input images before using denoising CAE. In these experiments, the noise factor is 0.25. Both vanilla and denoising autoencoder results on MNIST are given in Table 7. While Adam gives the lowest reconstruction loss for CAE-1 architecture, AdaMax and Nadam give the lowest for CAE-2. But the performances of Adam and its variants are so close to each other. On the other side, Nadam gives the best results for denosing CAE. RMSProp performs well on denoising task in addition to Adam variants. The performance of AdaMax slightly decreases for denoising task. Besides, AdaGrad falls behind the other adaptive algorithms in both tasks. The training loss for MNIST is shown in Figure 6. The behavior of Adam and AMSGrad beginning the training with the lowest loss is in sight.
The results on CIFAR-10, LFW and Kaggle Flowers are similar. Adam and its variants give the best results. Different from MNIST results, RMSProp begins training with the lowest reconstruction loss. However, Adam and its variants give better results at the end of the training. The results on Kaggle Flowers are interesting. For CAE-1, the performances of Adam and AMSGrad get worse in denosing task. Accordingly, it can be said that when the size of representation is small, Adam
18 D. Soydaner
and AMSGrad can not perform well on high resolution images. Also, RMSProp performs well together with Adam and its variants on Kaggle Flowers for both unsupervised tasks. Both vanilla and denoising autoencoder results are given in Table 8 for CIFAR-10, Table 9 for LFW and Table 10 for Kaggle Flowers, respectively. Also, the behaviour of the optimization algorithms during training for the unsupervised learning tasks on CIFAR-10 is shown in Figure 7, LFW in Figure 8 and Kaggle Flowers in Figure 9.
Table 7. Comparison of the algorithms on MNIST for reconstruction.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CAE
CAE-1 Loss CAE-2 Loss
0.0175
0.0160
0.0095
0.0074
0.0094
0.0074
0.0113
0.0084
0.0080
0.0053
0.0065
0.0044
0.0053
0.0038
0.0055
0.0036
0.0056
0.0036
0.0054
0.0038
Denoising CAE
CAE-1 Loss CAE-2 Loss
0.0219
0.0186
0.0120
0.0103
0.0120
0.0102
0.0140
0.0125
0.0105
0.0077
0.0087
0.0067
0.0082
0.0067
0.0088
0.0069
0.0080
0.0066
0.0083
0.0068
Fig. 6. The behaviour of algorithms on MNIST during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
A Comparison of Optimization Algorithms for Deep Learning 19
Table 8. Comparison of the algorithms on CIFAR-10 for reconstruction.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CAE
CAE-1 Loss CAE-2 Loss
0.0145
0.0152
0.0093
0.0091
0.0093
0.0089
0.0115
0.0067
0.0093
0.0088
0.0072
0.0059
0.0055
0.0047
0.0061
0.0047
0.0069
0.0046
0.0060
0.0049
Denoising CAE
CAE-1 Loss CAE-2 Loss
0.0189
0.0158
0.0119
0.0108
0.0119
0.0109
0.0167
0.0113
0.0113
0.0094
0.0102
0.0083
0.0096
0.0075
0.0087
0.0074
0.0087
0.0073
0.0098
0.0075
Fig. 7. The behaviour of algorithms on CIFAR-10 during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
SGD by far the worst for both unsupervised tasks on all datasets. In general, Adam and its variants seem to be much better than the other adaptive gradient methods in addition to SGD and its momentum variants. AdaGrad can not perform well among the adaptive algorithms.
20 D. Soydaner
Table 9. Comparison of the algorithms on LFW for reconstruction.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CAE
CAE-1 Loss CAE-2 Loss
0.0122
0.0136
0.0070
0.0077
0.0070
0.0077
0.0071
0.0046
0.0061
0.0055
0.0044
0.0033
0.0030
0.0018
0.0033
0.0019
0.0036
0.0025
0.0030
0.0018
Denoising CAE
CAE-1 Loss CAE-2 Loss
0.0125
0.0094
0.0073
0.0063
0.0071
0.0063
0.0090
0.0069
0.0086
0.0066
0.0062
0.0053
0.0044
0.0039
0.0052
0.0043
0.0050
0.0045
0.0049
0.0041
Fig. 8. The behaviour of algorithms on LFW during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
In order to compare the results clearly in visual, the reconstructions of test images obtained by all algorithms for CAE-1 architecture are shown in Figure 10. As handwritten images are simpler than the other datasets, all algorithms perform well on MNIST. Even though the reconstructions of SGD are a little blurry, the digits can be seen clearly. An important point is based on the Kaggle Flowers results. SGD and AdaMax can not reconstruct colors properly. Similar results can be seen for CAE-2 as shown in Figure 11. Here, it seems that all adaptive algorithms can reconstruct images with their colors when the size of representation increases.
A Comparison of Optimization Algorithms for Deep Learning 21
Table 10. Comparison of the algorithms on Kaggle Flowers for reconstruction.
Algorithm SGD SGD - momentum SGD - Nesterov AdaGrad AdaDelta RMSProp Adam AdaMax Nadam AMSGrad
CAE
CAE-1 Loss CAE-2 Loss
0.0359
0.0385
0.0217
0.0220
0.0217
0.0214
0.0298
0.0179
0.0220
0.0180
0.0202
0.0159
0.0144
0.0121
0.0205
0.0133
0.0170
0.0138
0.0145
0.0122
Denoising CAE
CAE-1 Loss CAE-2 Loss
0.0367
0.0347
0.0320
0.0217
0.0230
0.0219
0.0254
0.0234
0.0236
0.0203
0.0216
0.0179
0.0269
0.0148
0.0209
0.0164
0.0208
0.0166
0.0270
0.0146
Fig. 9. The behaviour of algorithms on Kaggle Flowers during training for two autoencoder architectures. (Top) Training loss for CAE. (Bottom) Training loss for denoising CAE.
For denoising task, the reconstructions of all datasets for CAE-1 are given in Figure 12. The small size of representation makes this task more difficult. None of algorithms can reconstruct colors properly on Kaggle Flowers. Also, as it is hard to reconstruct faces and objects, the results are blurry. However, as the representation size increases, the colors given by especially Adam and its variants are much better as shown in Figure 13. Also, while SGD and AdaGrad can reconstruct images with their colors on CIFAR-10 for CAE-1, both algorithms fail in denosing CAE-1.
22 D. Soydaner Fig. 10. Reconstructions of test images using CAE-1. Fig. 11. Reconstructions of test images using CAE-2.
A Comparison of Optimization Algorithms for Deep Learning 23 Fig. 12. Reconstructions of test images using denoising CAE-1. Fig. 13. Reconstructions of test images using denoising CAE-2.
24 D. Soydaner
Lastly, the algorithms are compared based on the training time they require. Similar to supervised learning experiments, training time does not vary significantly on LFW and Kaggle Flowers. However, training time on MNIST and CIFAR-10 for unsupervised learning experiments are shown in Figure 14. In general, Adam and Nadam need more time than other algorithms to reconstruct images. Similar to supervised results, when the dataset becomes more complex, they need more time. Also, AdaMax requires more time on CIFAR-10 as especially seen in CAE-1. All algorithms need more time for CAE-2 relative to CAE-1. Also, denoising task cause training time mostly decrease on CIFAR-10 and increase on MNIST. Even though SGD and its momentum variants train autoencoders fast by using the learning rate they perform best, they can not generalize well on test data.
Fig. 14. (Top) MNIST and CIFAR-10 training time for CAE-1 and CAE-2. (Bottom) MNIST and CIFAR-10 training time for denoising CAE-1 and CAE-2.
5. Conclusion In this study, the most commonly used optimization algorithms in deep learning are examined. The differences between their working principles are summarized considering pros and cons for each of them. In this context, the importance of adaptive learning algorithms is highlighted. The performances of algorithms are compared on four image datasets empirically. The behaviour of the algorithms
A Comparison of Optimization Algorithms for Deep Learning 25
during training is observed according to the effects of different image resolutions and neural network architectures. Because of adaptive methods are mostly superior and computationally efficient, their results seem to be better for both tasks. Still, the research to find better adaptive methods continues for deep learning.
References
1. E. Alpaydın, Introduction to Machine Learning, The MIT Press, 2014. 2. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley and Y. Bengio, Theano: A CPU and GPU math compiler in Python, In Proc. 9th Python in Science Conference, Austin, Texas, 2010, pp. 310. 3. L. Bottou, F.E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, Siam Review, 60(2) (2018) 223311. 4. Y. Dauphin, H. De Vries, J. Chung and Y. Bengio, RMSProp and equilibrated adaptive learning rates for non-convex optimization, CoRR abs/1502.04390, 2015. 5. T. Dozat, Incorporating Nesterov Momentum into Adam, International Conference on Learning Representations, San Juan, Puerto Rico, 1(2016) 2013-2016. 6. J. Duchi, E. Hazan and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011) 2121 2159. 7. A. Geron, Hands-on machine learning with Scikit-learn and Tensorflow, OReilly, 2017. 8. I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, The MIT Press, 2016. 9. H. Hayashi, J. Koushik and G. Neubig, Eve: A gradient based optimization method with locally and globally adaptive learning rates, arXiv preprint arXiv:1611.01505, 2016. 10. G. Hinton, Neural networks for machine learning, Coursera, video lectures, 2012. 11. G.B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled Faces in the Wild: A database for studying face recognition in unconstrained environments, Technical Report, University of Massachusetts, Amherst, 2007, pp. 0749. 12. H. Huang, C. Wang and B. Dong, Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate, arXiv preprint arXiv:1805.07557, 2018. 13. D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014. 14. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, 2009. 15. A. Krizhevsky, I. Sutskever and G. Hinton, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25(2012) 1097-1105. 16. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow and A.Y. Ng, On optimization methods for deep learning, in Proc. 28th International Conference on Machine Learning, Bellevue, Washington, USA, 2011, pp. 265272. 17. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86(11) (1998) 22782324. 18. L. Luo, Y. Xiong, Y. Liu and X. Sun, Adaptive gradient methods with dynamic bound of learning rate, International Conference on Learning Representations, New Orleans, 2019. 19. A. Mamaev, Flowers Recognition, https://www.kaggle.com/alxmamaev/flowersrecognition, 2018. 20. M.C. Mukkamala and M. Hein, Variants of RMSProp and Adagrad with logarithmic
26 D. Soydaner
regret bounds, in Proc. 34th International Conference on Machine Learning, Sydney, Australia, 70 (2017) 25452553. 21. Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2), Soviet Mathematics Doklady, 27 (1983) 372376. 22. B.T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4(5) (1964) 117. 23. S.J. Reddi, S. Kale and S. Kumar, On the convergence of Adam and beyond, International Conference on Learning Representations, Vancouver, Canada, 2018. 24. H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 22 (1951) 400407. 25. S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747, 2016. 26. N. Shazeer and M. Stern, Adafactor: Adaptive learning rates with sublinear memory cost, arXiv preprint arXiv:1804.04235, 2018. 27. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, International Conference on Learning Representations, San Diego, CA, USA, 2015. 28. D. Soham, M. Anirbit and U. Enayat, Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration, arXiv preprint arXiv:1807.06766, 2018. 29. I. Sutskever, J. Martens, G. Dahl and G. Hinton, On importance of initialization and momentum in deep learning, International Conference on Machine Learning, Atlanta, USA, 2013, pp. 11391147. 30. M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar, Adaptive methods for nonconvex optimization, Advances in Neural Information Processing Systems, 31 (2018), pp. 97939803. 31. M.D. Zeiler, ADADELTA: An adaptive learning rate method, arXiv preprint arXiv:1212.5701, 2012.
Derya Soydaner received her B.Sc., M.Sc. and Ph.D. degrees in statistics from Mimar Sinan Fine Arts University, Istanbul, Turkey in 2012, 2014 and 2018, respectively. She mainly studied on neural networks throughout her entire postgraduate education. Currently, she is a faculty member in the Department of Statistics at Mimar Sinan Fine Arts University. Her research interests include pattern recognition, machine learning and image processing.