## 1. Introduction of Cost Function in Machine Learning

Linear regression in machine learning via gradient descent can be used to estimate slope (*b _{1}*) and intercept (

*b*) for a linear regression model. The criteria for selecting the right

_{0}*b*and

_{0}*b*is to minimize the difference between the estimated y and the observed y.

_{1}We can write the criteria for minimizing the difference as follows, which is called the cost function in the machine learning context.

\[ C=\frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y_i})^2 \]

We can write out the predicated y as follows.

\[ \hat{y_i} = b_0 +b_1 x_i \]

Thus, the cost function can be rewritten as follows.

\[ C=\frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y_i})^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i-b_0 -b_1 x_i)^2 \]

## 2. Iteration Process in Machine Learning for Linear Regression

After knowing the cost function, we can calculate the partial derivatives as follows.

\[ \frac{\partial C}{\partial b_0}= \frac{-2}{n} \sum_{i=1}^{n} (y_i-b_0 -b_1 x_i) \]

\[ \frac{\partial C}{\partial b_1}= \frac{-2}{n} \sum_{i=1}^{n} (y_i-b_0 -b_1 x_i) x_i \]

The algorithm iteratively calculates the next point using a gradient at the current position, then multiplies with a learning rate, ** η**, which controls the step size of moving to the next point.

Then, it subtracts the obtained value (i.e., gradient*learning rate) from the current position. The process of jumping from the last position (** n**) to the new position (

**) is called making a step. And, this process can be written as follows.**

*n+1*\[ b_{0 (n+1)} = b_{0 (n)} – \eta \frac{\partial C}{\partial b_{0 (n)}} \]

\[ b_{1 (n+1) } = b_{1 (n)} – \eta \frac{\partial C}{\partial b_{1 (n)}} \]

They can be rewritten as follows.

\[ b_{0 (n+1)} = b_{0 (n)} – \eta \frac{\partial C}{\partial b_{0 (n)}}= b_{0 (n) } – \eta (\frac{-2}{n} \sum_{i=1}^{n} (y_i-b_{0 (n)} -b_{1 (n)} x_i) ) \]

\[ b_{1 (n+1) } = b_{1 (n)} – \eta \frac{\partial C}{\partial b_{1 (n)}}= b_{0 (n)} – \eta (\frac{-2}{n} \sum_{i=1}^{n} (y_i-b_{0 (n)} -b_{1 (n)} x_i) x_i ) \]

## 3. Python Code for Linear Regression in Machine Learning

```
# import numpy
import numpy as np
# defining the function to update the steps
def updating_steps(x, y, b_1, b_0, learning_rate):
b1_deriv = 0
b0_deriv = 0
n_number = len(x)
y_predicted=b_0 + b_1*x
b0_deriv = -2*np.sum(y - y_predicted)
b1_deriv = -2*np.dot((y - y_predicted),x)
b_1 -= (b1_deriv/n_number)*learning_rate
b_0 -= (b0_deriv/n_number)*learning_rate
return(b_0,b_1)
# iteration process of finding the coefficients
def prediction(x, y, b_1, b_0, learning_rate, iters):
b_0_history = []
b_1_history = []
for i in range(iters):
b_0, b_1= updating_steps(x, y, b_1, b_0, learning_rate)
b_0_history.append(b_0)
b_1_history.append(b_1)
if i % 100 == 0:
print(i,"b_0=",b_0, "b_1=",b_1)
return(b_0_history,b_1_history)
```

We can then download data and apply the prediction function. We are going to use the following model, using the radio to predict sales.

*sales = b _{0}+b_{1}radio*

Thus, we use the following Python code to estimate b_{0} and b_{1}. Note that, this dataset is originally from the book An Introduction to Statistical Learning.

```
# download data from Github
import pandas as pd
df_train=pd.read_csv("https://raw.githubusercontent.com/TidyPython/machine_learning/main/Advertising.csv")
# apply the prediction function
b_0_history,b_1_history=prediction(x=df_train['radio'], y=df_train['sales'], b_1=0,b_0=4, learning_rate=0.001,iters=10000)
```

The following is the partial output:

0 b_0= 4.020045 b_1= 0.5551519 100 b_0= 4.310919241910213 b_1= 0.3555198433855029 200 b_0= 4.591006211887993 b_1= 0.3469490665522249 300 b_0= 4.85540569511566 b_1= 0.33885833328548937 ... [do not print this part, to save space] ... 9600 b_0= 9.290698724706752 b_1= 0.2031365367669492 9700 b_0= 9.291871525057088 b_1= 0.2031006485923814 9800 b_0= 9.292978637632094 b_1= 0.20306677049083868 9900 b_0= 9.294023741560824 b_1= 0.20303478987945422

## 4. Plot Iteration Process for Linear Regression in Machine Learning

```
# import matplotlib
from matplotlib import pyplot as plt
# set the size of the figure
plt.rcParams['figure.figsize'] = [10, 6]
# plot the iteration process for b_1
plt.plot(b_1_history)
```

Output:

```
# plot the iteration process for b_0
plt.plot(b_0_history)
```

Output:

## 5. Conclusion and Compare to Ordinary Least Square (OLS)

We can see that b_{0 }converges to 9.234, and b_{1} converges to 0.203. We can write the model statement below for linear regression using gradient descent.

*sales = b _{0}+b_{1}radio*=9.234 + 0.203 radio

We can compare it to the approach of Ordinary Least Square (OLS). Below, we use the OLS function in Numpy to calculate the regression coefficients.

```
# add 1s into the array
x_array=np.array([np.ones(200),df_train['radio']])
x_array=x_array.T
# use the OLS function in Numpy
results_1=np.linalg.lstsq(x_array, df_train['sales'], rcond=None)[0]
# print out the result
print(results_1)
```

Output:

[9.3116381 0.20249578]

Thus, we can see that b_{0 }is 9.312, and b_{1} converges to 0.202. Thus, they are pretty close to the gradient descent machine learning method.