# What is Gradient Descent in Machine Learning? Definition and Python Example

Gradient Descent (GD) is an optimization algorithm using gradient to help find a local minimum of a function. (For more details on gradients, please refer to my other tutorial.) When the function is convex, a local minimum is also the global minimum.

## Steps of Gradient Descent Algorithm

Step 1: Calculate the gradient function. In simple terms, it is the partial derivative of a given function.

$\nabla f(p)=\left[ \begin{array} {} \frac{\partial f}{\partial x} f(p) \\ \frac{\partial f}{\partial y} f(p) \\..\\\frac{\partial f}{\partial w} f(p) \end{array} \right]$

Step 2: Determine the learning rate. The learning rate multiplies with the gradient to determine the step size in the iteration.

$Learning \ Rate : \gamma$

$Step \ Size : \gamma \nabla f(p_n)$

Step 3: Calculate a new step, which is the last step being subtracted with the step size. Note that, for the very first step, p0 needs to be guessed and provided as part of the input.

$p_{1}= p_0 – \gamma \nabla f(p_0)$

$p_{2}= p_1 – \gamma \nabla f(p_1)$

$…$

$p_{n+1}= p_n – \gamma \nabla f(p_n)$

Step 4: Keep iterating till the absolute value of step size is smaller than a certain preset tolerance value (tol).

$abs(\gamma \nabla f(p_n)) < tol$

For instance, for a function with a single independent variable.

$\ y=x^2+6x+10$

Step 1: We calculate the gradient of the function.

$\ \frac{dy}{dx}=2x+6$

The gradient can be also written as follows.

$\nabla f(p)=\left[ \begin{array} {} 2p+6 \end{array} \right]$

Step 2: Determine the learning rate. We can set it as 0.1 or any reasonable number based on the function.

Step 3: We can calculate the learning steps. We can preset the p0 to be 7. Note that, after generating p1, in each step here, you need to check step 4.

$p_{1}= p_0 – \gamma \nabla f(p_0) = 7- 0.1 \times (2 \times 7+ 6) = 5$

$p_{2}= p_1 – \gamma \nabla f(p_1) = 5-0.1 \times (2 \times 5 + 6) = 3.4$

Step 4: We compare step size with the predetermined value of 0.01. For point 1 of (7, 101), the step size is 2, which is greater than 0.01. Thus, we back to step 3 to continue.

$\gamma \nabla f(p_n) = 0.1 \times (2 \times 7+ 6) = 2$

For point 2 (5, 65), the step size is 1.6, which is greater than 0.01. Thus, we back to step 3 again.

$\gamma \nabla f(p_n) = 0.1 \times (2 \times 5+ 6) = 1.6$

## Python Code of Gradient Descent

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Define the gradient descent algorithm
def GD_function(start, learning_rate, max_itr, tol):
steps_taken = [start]
x = start

for i in range(max_itr):
step_size = learning_rate*(2*x+6)
if np.abs(step_size)<tol:
break
x = x - step_size
steps_taken.append(x)

return steps_taken

# Use the gradient descent function. Start point is 7, learning rate is 0.1
# maximum iteration is 1000, tolerance is 0.01
steps_taken = GD_function(7, 0.1, 1000, 0.01)

# save the steps (i.e., x) into a dataframe
data_XY["Step_X"] =pd.DataFrame({"Step_X":steps_taken})

# calculate the y value of each step (x)
data_XY["Step_Y"] =function_1(data_XY["Step_X"])

# calculate the step size in each iteration
data_XY["Step_size"] =0.1*(2*data_XY["Step_X"]+6)

# print out X value and step size for each step
print(data_XY[["Step_X","Step_size"]] )
      Step_X  Step_size
0   7.000000   2.000000
1   5.000000   1.600000
2   3.400000   1.280000
3   2.120000   1.024000
4   1.096000   0.819200
5   0.276800   0.655360
6  -0.378560   0.524288
7  -0.902848   0.419430
8  -1.322278   0.335544
9  -1.657823   0.268435
10 -1.926258   0.214748
11 -2.141007   0.171799
12 -2.312805   0.137439
13 -2.450244   0.109951
14 -2.560195   0.087961
15 -2.648156   0.070369
16 -2.718525   0.056295
17 -2.774820   0.045036
18 -2.819856   0.036029
19 -2.855885   0.028823
20 -2.884708   0.023058
21 -2.907766   0.018447
22 -2.926213   0.014757
23 -2.940970   0.011806
24 -2.952776   0.009445

Since we got the data, we can use matplotlib to plot it. The code below does that.

plt.rcParams['figure.figsize'] = [5, 8]
plt.plot(data_XY["X"], data_XY["Y"],c="Black")
plt.scatter(data_XY["Step_X"],data_XY["Step_Y"],c='Red')

The following shows the steps (x-axis values) as well as step sizes in the iteration process. The interesting thing is that, it is very smooth, or monotonic.

## Final Remarks

• learning rate * gradient  is subtracted from pn because we want to move against the gradient toward the local minimum. What does it mean by doing learning rate * gradient ? The following chart illustrates the process. The second step is (5, 65). We know that the step size is 1.6 (see the calculation above). Thus, the next step is (3.4, 41.96), where 5-1.6=3.4 and 3.42 +6*3.4+10 =41.96. The reason why we do not fix a step size is to incoporate the gradient into the consideration of the step size: When gradient is greater, the step size is greater as well (we can see the step size changes in the figure above clearly).
• Change step size can change the iteration process dramatically. For instance, if we change from 0.1 to 0.4, the following is the only line of code that is different from the code shown above. The following charts show a much different iteratio nprocess. The basic idea is that it needs much fewer points to converge.
steps_taken = GD_function(7, 0.4, 1000, 0.01)
• If the learning rate is too large, the process might not be monotonic. For instance, if we change the learning rate to 0.9, the following is the iteration process.

• Gradient Descent (GD) needs the function to be convex. Thus, if the learning rate is appropriate, we can get a monotonic sequence. In this case, the convergence of finding the local minimum is guaranteed. However, an inapproprate learning rate may lead to something very different, or even not being able to converge.

$f(p_0) > f(p_1) > f(p_2) > f(p_3) …$

• Even without gradient descent and Python code, we know that the lowest point of the bell shape is (-3, 1), based on the following simple math transformation. However, the computer does not have such knowledge. Instead, the computer can use a numerical method to find the minimum. Also, note that, gradient descent is used to estimate model based on cost function in machine learning. Thus, it is slightly different from the situation discussed in this post. I am still working on the post showing how to use gradient descent for linear regression model, and will post it soon.
$\ y=x^2+6x+10 = (x+3)^2+1$