Gradient Descent is an iterative optimization algorithm used to find the local minima/maxima of a given function. It is used very frequently in Machine Learning. Its application in ML is to reduce the cost/loss function which helps in choosing an efficient model.
For gradient descent, the function must be:
- differentiable: it must have a derivable slope for each point
- convex: line connecting the function’s point must lay on or above its curve but not cross it anywhere (https://galacodes.tech.blog/wp-content/uploads/2022/05/8dcff-1ofzseguvgdg15jgt2i__9g.png)
In gradient descent, the algorithm iteratively calculates the next point using gradient at the current point and then scales it using the learning rate. Then it subtracts that value from the current point and makes the step.
The learning rate is very important in this stage:
- Higher Learning Rate: If LR is too big, the algorithm may jump over and miss the local minima completely.
- Lower Learning Rate: If LR is too small, it may take forever before the algorithm converges to a local minimum.
Overview of Gradient Descent Algorithm:
- choose a starting point (initialization)
- calculate the gradient at this point
- make a scaled step in the opposite direction to the gradient (objective: minimize)
- repeat points 2 and 3 until one of the criteria is met:
- maximum number of iterations reached
- step size is smaller than the tolerance
Types of Gradient Descent Algorithms:
In gradient descent, we usually take the entire data set as a batch and iteratively try to reduce the loss. This ensures the best model but if the data set is too huge (millions or billions of data items) then it will take a very long time for even a single iteration to finish. Thus it is not always possible to use Batch Gradient Descent (BGD)
- Stochastic Gradient Descent – In SGD we use only a single example from the data set in every iteration. This may not be the best way to reduce the loss but it still works in some cases. The term Stochastic indicates that the one example in each batch is taken at random.
- Mini-batch Gradient Descent – This is a kind of compromise between Full Batch GD and a Stochastic GD. Here we select a small batch of about 10-1,000 at random, in each iteration and use the algorithm on that batch. This has lower noise (error) than SGD and is much more efficient.
Why this works?
A large data set often contains redundant (duplicate) data. In fact, redundancy becomes more likely as batch size increases. Some redundancy may be useful to smooth out noise gradients but too much of it reduces the power of prediction from it. So having a small batch that is representative (since it is selected at random) of the entire batch helps in tackling redundancy.
Leave a comment