What is Gradient Descent?
Gradient Descent (GD)
is an optimization algorithm using gradient
to help find a local minimum of a function. (For more details on gradients, please refer to my other tutorial.) When the function is convex, a local minimum is also the global minimum.

Steps of Gradient Descent Algorithm
Step 1: Calculate the gradient function. In simple terms, it is the partial derivative of a given function.
\[ \nabla f(p)=\left[ \begin{array} {} \frac{\partial f}{\partial x} f(p) \\ \frac{\partial f}{\partial y} f(p) \\..\\\frac{\partial f}{\partial w} f(p) \end{array} \right] \]
Step 2: Determine the learning rate. The learning rate multiplies with the gradient to determine the step size in the iteration.
\[ Learning \ Rate : \gamma \]
\[ Step \ Size : \gamma \nabla f(p_n) \]
Step 3: Calculate a new step, which is the last step being subtracted with the step size. Note that, for the very first step, p0
needs to be guessed and provided as part of the input.
\[ p_{1}= p_0 – \gamma \nabla f(p_0) \]
\[ p_{2}= p_1 – \gamma \nabla f(p_1) \]
\[ … \]
\[ p_{n+1}= p_n – \gamma \nabla f(p_n) \]
Step 4: Keep iterating till the absolute value of step size is smaller than a certain preset tolerance value (tol
).
\[ abs(\gamma \nabla f(p_n)) < tol \]
Example of Gradient Descent
For instance, for a function with a single independent variable.
\[ \ y=x^2+6x+10 \]
Step 1: We calculate the gradient of the function.
\[ \ \frac{dy}{dx}=2x+6 \]
The gradient can be also written as follows.
\[ \nabla f(p)=\left[ \begin{array} {} 2p+6 \end{array} \right]\]
Step 2: Determine the learning rate. We can set it as 0.1 or any reasonable number based on the function.
Step 3: We can calculate the learning steps. We can preset the p0 to be 7. Note that, after generating p1
, in each step here, you need to check step 4.
\[ p_{1}= p_0 – \gamma \nabla f(p_0) = 7- 0.1 \times (2 \times 7+ 6) = 5 \]
\[ p_{2}= p_1 – \gamma \nabla f(p_1) = 5-0.1 \times (2 \times 5 + 6) = 3.4\]
Step 4: We compare step size with the predetermined value of 0.01. For point 1 of (7, 101), the step size is 2, which is greater than 0.01. Thus, we back to step 3 to continue.
\[ \gamma \nabla f(p_n) = 0.1 \times (2 \times 7+ 6) = 2 \]
For point 2 (5, 65), the step size is 1.6, which is greater than 0.01. Thus, we back to step 3 again.
\[ \gamma \nabla f(p_n) = 0.1 \times (2 \times 5+ 6) = 1.6 \]

Python Code of Gradient Descent
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define the gradient descent algorithm
def GD_function(start, learning_rate, max_itr, tol):
steps_taken = [start]
x = start
for i in range(max_itr):
step_size = learning_rate*(2*x+6)
if np.abs(step_size)<tol:
break
x = x - step_size
steps_taken.append(x)
return steps_taken
# Use the gradient descent function. Start point is 7, learning rate is 0.1
# maximum iteration is 1000, tolerance is 0.01
steps_taken = GD_function(7, 0.1, 1000, 0.01)
# save the steps (i.e., x) into a dataframe
data_XY["Step_X"] =pd.DataFrame({"Step_X":steps_taken})
# calculate the y value of each step (x)
data_XY["Step_Y"] =function_1(data_XY["Step_X"])
# calculate the step size in each iteration
data_XY["Step_size"] =0.1*(2*data_XY["Step_X"]+6)
# print out X value and step size for each step
print(data_XY[["Step_X","Step_size"]] )
Step_X Step_size 0 7.000000 2.000000 1 5.000000 1.600000 2 3.400000 1.280000 3 2.120000 1.024000 4 1.096000 0.819200 5 0.276800 0.655360 6 -0.378560 0.524288 7 -0.902848 0.419430 8 -1.322278 0.335544 9 -1.657823 0.268435 10 -1.926258 0.214748 11 -2.141007 0.171799 12 -2.312805 0.137439 13 -2.450244 0.109951 14 -2.560195 0.087961 15 -2.648156 0.070369 16 -2.718525 0.056295 17 -2.774820 0.045036 18 -2.819856 0.036029 19 -2.855885 0.028823 20 -2.884708 0.023058 21 -2.907766 0.018447 22 -2.926213 0.014757 23 -2.940970 0.011806 24 -2.952776 0.009445
Since we got the data, we can use matplotlib to plot it. The code below does that.
plt.rcParams['figure.figsize'] = [5, 8]
plt.plot(data_XY["X"], data_XY["Y"],c="Black")
plt.scatter(data_XY["Step_X"],data_XY["Step_Y"],c='Red')

The following shows the steps (x-axis values) as well as step sizes in the iteration process. The interesting thing is that, it is very smooth, or monotonic.

Final Remarks
learning rate * gradient
is subtracted frompn
because we want to move against the gradient toward the local minimum. What does it mean by doinglearning rate * gradient
? The following chart illustrates the process. The second step is(5, 65)
. We know that the step size is 1.6 (see the calculation above). Thus, the next step is(3.4, 41.96)
, where5-1.6=3.4
and3.42 +6*3.4+10 =41.96
. The reason why we do not fix a step size is to incoporate the gradient into the consideration of the step size: When gradient is greater, the step size is greater as well (we can see the step size changes in the figure above clearly).

- Change step size can change the iteration process dramatically. For instance, if we change from 0.1 to 0.4, the following is the only line of code that is different from the code shown above. The following charts show a much different iteratio nprocess. The basic idea is that it needs much fewer points to converge.
steps_taken = GD_function(7, 0.4, 1000, 0.01)


- If the learning rate is too large, the process might not be monotonic. For instance, if we change the learning rate to 0.9, the following is the iteration process.


Gradient Descent (GD)
needs the function to be convex. Thus, if the learning rate is appropriate, we can get a monotonic sequence. In this case, the convergence of finding the local minimum is guaranteed. However, an inapproprate learning rate may lead to something very different, or even not being able to converge.
\[ f(p_0) > f(p_1) > f(p_2) > f(p_3) … \]

- Even without gradient descent and Python code, we know that the lowest point of the bell shape is
(-3, 1)
, based on the following simple math transformation. However, the computer does not have such knowledge. Instead, the computer can use a numerical method to find the minimum. Also, note that, gradient descent is used to estimate model based on cost function in machine learning. Thus, it is slightly different from the situation discussed in this post. I am still working on the post showing how to use gradient descent for linear regression model, and will post it soon.
\[ \ y=x^2+6x+10 = (x+3)^2+1 \]
Other Resource
For a discussion about learning rate, you can refer to the a post from Jeremy Jordan.
I also have another tutorial about gradient in machine learning, and you can check it out.