Understanding Gradient Descent and Its Concepts
Do you know these…!
Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function. The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. We use gradient descent to update the parameters of our model. For example, parameters refer to coefficients in Linear Regression and weights in neural networks.
In this article, we will look at some concepts of gradient descent and cost function.
What is the Cost Function?
The primary setup for learning neural networks is to define a cost function that measures how well the network predicts outputs on the test set. The goal is to then find a set of weights and biases that minimize the cost. One common function that is often used is the mean squared error, which measures the difference between the actual value of y and the estimated value of y. The equation of the below regression line is hθ(x) = θ + θ1x, which has only two parameters: weight (θ1)and bias (θ0).
Minimizing Cost function
The goal of any Machine Learning model is to minimize the Cost Function.
Our goal is to move from the mountain in the top right corner which has a lot of costs that is the high cost to the dark blue sea in the bottom left. To get the lowest error value, we need to adjust the weights ‘θ0’ and ‘θ1’ to reach the smallest possible error. This is because the result of a lower error between the actual and the predicted values means the algorithm has done a good job of learning. Gradient descent is an efficient optimization algorithm that attempts to find a local or global minimum of a function.
Calculating gradient descent
Gradient Descent runs iteratively to find the optimal values of the parameters corresponding to the minimum value of the given cost function, using calculus. Mathematically, the technique of the ‘derivative’ is extremely important to minimize the cost function because it helps get the minimum point. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values to get a lower cost on the next iteration.
The derivative of a function (in our case, J(θ)) on each parameter (in our case weight θ) tells us the sensitivity of the function concerning that variable or how changing the variable impacts the function value. Gradient descent, therefore, enables the learning process to make corrective updates to the learned estimates that move the model toward an optimal combination of parameters (θ). The cost is calculated for a machine learning algorithm over the entire training dataset for each iteration of the gradient descent algorithm. In Gradient Descent, one iteration of the algorithm is called one batch, which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration.
The step of the derivation
It would be better if you have some basic understanding of calculus because the technique of the partial derivative and the chain rule is being applied in this case.
To solve for the gradient, we iterate through our data points using our new weight ‘θ0’ and bias ‘θ1’ values and compute the partial derivatives. This new gradient tells us the slope of our cost function at our current position and the direction we should move to update our parameters. The size of our update is controlled by the learning rate.
Learning rate (α)
The size of these steps is called the learning rate (α) which gives us some additional control over how large steps we make. With a large learning rate, we can cover more ground with each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom. The most commonly used rates are: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.
Now let’s discuss the three variants of the gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning step).