Dynamic learning rates
The rate at which SGD jumps between successive increments is determined by the learning rate. The learning rate is generally held constant by default. During training, a constant learning rate can cause a number of issues:
- If the learning rate is too small, the optimisation will need to be run a lot of times (taking a long time and potentially never reaching the optimum).
- If the learning rate is too big then the optimisation may be unstable (bouncing around the optimum, and maybe even getting worse rather than better).
- The optimisation may get stuck in an unsatisfactory local minima, or other challenging areas like a saddle point[^saddle_points].
The learning rate can be manually adjusted throughout the training process to improve performance, but there are a number of approaches for dynamically adjusting it that can be used.
The loss profile for DNNs often includes saddle points - areas where the gradient of the loss function reduces, making the gradient descent process much slower. Momentum is intended to help speed the optimisation process through cases like this.
Momentum works by taking the gradient calculated by SGD and adding a factor to it. The added factor can be thought of as the average of the previous gradients. Thus if the previous gradients were zig zagging through a saddle point, their average will be along the valley of the saddle point. Adding this average direction will help the optimisation to proceed in the right direction.
The “average” of the previous gradients is often calculated as a exponentially weighted moving average (for example, 0.9 x dL[n-1] + 0.1 x dL[n]).
Traditionally the learning rate is constant for all parameters in the model. Adagrad is a technique to adjust the learning rate for each individual parameter. If a parameter has a low gradient, Adagrad will barely modify the learning rate for that parameter. If a parameter has a high gradient, Adagrad will shrink the learning rate for that parameter.
The implementation looks at the gradients that were previously calculated for a parameter. It then squares all of these gradients (which ignores the sign and only considers the magnitude), adds all of the squares together, and then takes the square root. For the next epoch, the learning rate for this parameter is the overall learning rate divided by this calculated value.
The one downfall is that the division parameter is always increasing, thus the learning rate is always decreasing. Because of this the training will reach a point where a given parameter can only ever be updated by a tiny amount, effectively meaning that parameter can no longer learn any further.
RMSPRop is very similar to Adagrad, with the aim of resolving Adagrad’s primary limitation. Adagrad will continually shrink the learning rate for a given parameter (effectively stopping training on that parameter eventually). RMSProp is able to shrink or increase the learning rate.
RMSProp will divide the overall learning rate by the square root of the sum of squares of the previous update gradients for a given parameter (as is done in Adagrad). The difference is that RMSProp doesn’t weight all of the previous update gradients equally, it uses an exponentially weighted moving average of the previous update gradients. This means that older values contribute less than newer values. This allows it to jump around the optimum without getting further and further away.
Adam (Adaptive Moment Estimation) combines the benefits of momentum with the benefits of RMSProp. Momentum is looking at the mean change in the parameter, and continuing to adjust a parameter in that direction. RMSProp is looking at the recent variance in a parameter, and shrinking the learning rate proportionally. Adam does both of these things - it multiplies the learning rate by the momentum, but also divides by a factor related to the variance.