Stochastic Gradient Descent (SGD)
SGD is an optimisation technique - a tool used to update the parameters of a model. Most optimisation techniques (including SGD) are used in an iterative fashion: The first run adjusts the parameters a bit, and consecutive runs keep adjusting the parameters (hopefully improving them).
Why use SGD?
SGD is an optimisation technique. It is an alternative to Standard Gradient Descent and other approaches like batch training or BFGS. It still leads to fast convergence, with some advantages:
- Doesn't require storing all training data in memory (good for large training sets)
- Allows adding new data in an "online" setting
How is it different to other techniques?
Rather than evaluating a "cost function" over the entire training set (as in Standard Gradient Descent), SGD uses a subset of the training data (a minibatch). The subset that gets used should change each iteration, and it should be selected randomly each iteration.
The number of datapoints to use in the minibatch is not set in stone, the considerations are:
- Slow convergence (the parameters may jump around a lot, rather than smoothly approaching the optimum outcome).
- The model parameters and the minibatch data can be stored in matrices. There are some efficiency wins by doing smart matrix operations, so having a small minibatch means we aren't making the most of these smart matrix operations.
- The speed advantages get lost as more of the training data gets used.
Optimisation techniques often have a variable or parameter called a learning rate. The learning rate specifies how aggressively the optimisation technique should jump between each iteration.
- If the learning rate is too small, the optimisation will need to be run a lot of times (taking a long time and potentially never reaching the optimum).
- If the learning rate is too big then the optimisation may be unstable (bouncing around the optimum, and maybe even getting worse rather than better).
Because SGD uses a subset of all the training data, there will be more variance in the parameters for each iteration than batch optimisation techniques. Because of this the learning rate should be set to be smaller than the learning rate for batch techniques.
Dynamic Learning Rates
Initially a model's parameters may be far from their optimum values. That means a high learning rate can be used, to make big jumps in the right direction. As the parameters get closer to their optimum values, big jumps in a parameter's value may actually jump past the optimum. In this case the optimum parameter values may never be reached. One solution to this is to change what the learning rate is as the parameters are optimised. There are a range of ways this can be done, here are a couple of examples:
- Use a large learning rate for the first couple of iterations, and then shrink the learning rate for each consecutive iteration.
- Use a held-out set to assess learning:
- Hold a set of the training data out of the optimisation process.
- Once an iteration is complete, run the held-out data through the model and look at the performance of the model.
- If the performance improvement between iterations has dropped below a certain threshold, shrink the learning rate.
- "backtracking line search"???
See dynamic learning rates for more info.
As parameters approach a local optimum, the improvements can slow down, taking a long time to finally get to the local optimum. Adding "momentum" to the optimisation technique can help to keep improving the model parameters towards the end of the optimisation process. Momentum will look at how the parameters were changing over the last few iterations, and use that information to keep moving in the same direction. The number of prior iterations to take into account changes as the initial learning stabilises.