Tutorial of Data Visualization Using Python

Introduction

This tutorial shows how to plot line charts, bar charts, and scatter plots in Python. The major packages being used include Pandas, Matplotlib, and Seaborn. Note that, Pandas plot functions and Seaborn build on the top of Matplotlib, and thus you can use some functions from Matplotlib. In some situations, I found it easier to use Pandas and Seaborn first. Then on the top of this plot, we further use Matplotlib to modify some parameters in the chart.


Line Charts

Line charts are typically used to show an overall trend of a certain topic. For instance, you can use a line chart to show the overall price movement of a stock or people’s interest in a certain topic or object.

In particular, I am going to use Peloton’s Google Trends data to plot the overall trends of consumers’ interest in Peloton. The code below reads the CSV data file into the environment and print it out. (To learn more about dataframe, you can refer to this post.)

import pandas as pd
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
print(data_Peloton)
          Week  Peleton
0    3/10/2019       10
1    3/17/2019       11
2    3/24/2019       10
3    3/31/2019        9
4     4/7/2019        8
..         ...      ...
156   3/6/2022       19
157  3/13/2022       17
158  3/20/2022       16
159  3/27/2022       15
160   4/3/2022       15

[161 rows x 2 columns]

We can use matplotlib’s plot function to plot this line chart. The basic structure of the plot function is as follows. That is, it needs you to specify a variable on x-axis and a variable on y-axis.

plt.plot('xlabel', 'ylabel', data=obj)

After adding the line of code above, the following is the most simple version of the complete code.

import pandas as pd

import matplotlib.pyplot as plt
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
plt.plot('Week', 'Peloton',data=data_Peloton)
plt.show()

The following is the figure output. As you can see, while the y-axis looks great, the x-axis looks very crowded. To solve that problem, we need to use reduce the number of labels shown on the x-axis.

Line Chart of Google Trends of Peloton using Python Matplotlib
Line Chart of Google Trends of Peloton using Python Matplotlib

Add X-axis Intervals and Title

To specify the interval, there are quite a few different methods of doing that, see the figure below.

plt.gca().xaxis.set_major_locator(plt.MultipleLocator(30))
plt.title("Google Trends of Peloton",fontdict = {'fontsize' : 10})
Scale Intervals in Python Matplotlib
Scale Intervals in Python Matplotlib

Below is the complete version of the updated python code.

import pandas as pd

import matplotlib.pyplot as plt
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
plt.plot('Week', 'Peloton',data=data_Peloton)
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(30))
plt.title("Google Trends of Peloton",fontdict = {'fontsize' : 10})
Line Chart of Google Trends of Peloton using Python Matplotlib
Line Chart of Google Trends of Peloton using Python Matplotlib

Bar Charts

You can also use bar charts for a similar job as line charts. In other words, often you will find that bar charts and line charts are interchangeable. However, when there are multiple variables on the Y axis, it might not be a good idea to use bar charts as the figure will get crowded. In this case, you might want to consider using line charts. Below, I use examples to show what I mean here.

Single Y Variable using Bar Charts

import pandas as pd
import matplotlib.pyplot as plt
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
plt.bar('Week', 'Peloton',data=data_Peloton)
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(30))
plt.title("Google Trends of Peloton",fontdict = {'fontsize' : 10})
plt.show()

The following is the chart. As we can see, it is not much different from what is shown in the line chart shown above. You should note that this is the case where there is only one variable on Y-axis. However, in some cases, you might have multiple variables on Y-axis. Below I will show the case.

Bar Chart of Google Trends of Peloton using Python matplotlib
Bar Chart of Google Trends of Peloton using Python matplotlib

Multiple Y Variables using Bar Charts

The data set below includes multiple columns of data, namely RD Expenses, Sales and Marketing, and General Admin Expenses. We can plot them into the same bar chart, namely put all 3 on the Y-axis, whereas the Quarter column is on the X-axis.

import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
print(data_MSFT_T)
   Quarter RD Expenses Sales and Marketing General Admin Expenses
1   2017Q1        3355                3879                   1202
2   2017Q2        3514                4356                   1355
3   2017Q3        3574                3812                   1166
4   2017Q4        3504                4562                   1109
5   2018Q1        3715                4335                   1208
6   2018Q2        3933                4760                   1271
7   2018Q3        3977                4098                   1149
8   2018Q4        4070                4588                   1132
9   2019Q1        4316                4565                   1179
10  2019Q2        4513                4962                   1425
11  2019Q3        4565                4337                   1061
12  2019Q4        4603                4933                   1121
13  2020Q1        4887                4911                   1273
14  2020Q2        5214                5417                   1656
15  2020Q3        4926                4231                   1119
16  2020Q4        4899                4947                   1139
17  2021Q1        5204                5082                   1327
18  2021Q2        5687                5857                   1522
19  2021Q3        5599                4547                   1287
20  2021Q4        5758                5379                   1384

The following is the complete Python code and figure output and figure output. When looking at the code below, you should notice that the following code line does not use plt, which is directly from the package of matplotlib. Thus, the code line below is using Pandas function of pandas.DataFrame.plot, which is built on the top of matplotlib. That is why in the end, you have to include plt.show().

data_MSFT_T.plot(x='Quarter', kind='bar', stacked=False,)
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T.plot(x='Quarter', kind='bar', stacked=False,)
plt.show()
Bar Chart of Microsoft Operating Expenses using Python Pandas and matplotlib
Bar Chart of Microsoft Operating Expenses using Python Pandas and matplotlib

We can change the font size of the x-axis and legend using the functions from matplotlib. Finally, similar to the line chart above, we can also add a title and set its font size. Below is the updated version of the complete code and figure output.

import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T.plot(x='Quarter', kind='bar', stacked=False)
plt.xlabel('Quarters', fontsize=10)
plt.legend(fontsize=8) 
plt.title("Microsoft Operating Expenses",fontdict = {'fontsize' : 10})
plt.show()
Bar Chart of Microsoft Operating Expenses using Python Pandas and matplotlib
Bar Chart of Microsoft Operating Expenses using Python Pandas and matplotlib

Compare Bar Charts versus Line Charts

As mentioned, the bar chart above is a bit crowded, we can use a line chart instead. Pandas plot function has line charts as the default. Thus, we can just use df.plot() to plot a line chart.

The kind of plot to produce:
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot (DataFrame only)
‘hexbin’ : hexbin plot (DataFrame only)
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T.plot()
plt.xlabel('Quarters', fontsize=10)
plt.legend(fontsize=10)
plt.title("Microsoft Operating Expenses",fontdict = {'fontsize' : 10})
plt.show()
Line Chart of Microsoft Operating Expenses using Python Pandas and matplotlib
Line Chart of Microsoft Operating Expenses using Python Pandas and matplotlib

Improved Version

You might notice the problem with the X-axis, namely the scale is not using the quarter column data. This is due to the fact that the plot function just uses the row index to plot the X-axis by default. With that, we can set_index the column of the quarter as the row index of the dataframe. Below is the improved Python code. Personally, I prefer to use line charts when there is more than one Y value since it will look less crowded.

import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T=data_MSFT_T.set_index('Quarter')
data_MSFT_T.plot()
plt.xlabel('Quarters', fontsize=10)
plt.legend(fontsize=10)
plt.title("Microsoft Operating Expenses",fontdict = {'fontsize' : 10})
plt.show()
Line Chart of Microsoft Operating Expenses using Python Pandas and matplotlib
Line Chart of Microsoft Operating Expenses using Python Pandas and matplotlib

Scatter Plot

Use Pandas for Scatter Plot

A scatter plot is used to show the relationship between two variables. For instance, if you want to understand how people’s concerns about Covid are related to interest in buying Peloton, we can do a scatter plot to see whether there is any relationship, upward trend or downward trend, or just flat. The following is the data being used, namely data from Google Trends.

import pandas as pd
import matplotlib.pyplot as plt
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
print(data_Covid_Peloton)
          Week  Peloton  Covid
0    2/23/2020       26      2
1     3/1/2020       24      5
2     3/8/2020       24     20
3    3/15/2020       67     53
4    3/22/2020       69     62
..         ...      ...    ...
106   3/6/2022       32     16
107  3/13/2022       29     16
108  3/20/2022       27     14
109  3/27/2022       25     14
110   4/3/2022       25     14

[111 rows x 3 columns]
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
data_Covid_Peloton.plot(kind="scatter",x="Covid",y="Peloton")
plt.xlabel('Keyword of Covid', fontsize=10)
plt.ylabel('Keyword of Peloton', fontsize=10)
plt.title("Relationship of Covid and Peloton based on Google Trends",fontdict = {'fontsize' : 10})
plt.show()
Scatter Plot of Relationship of Covid and Peloton based on Google Trends using Python Pandas and matplotlib
Scatter Plot of Relationship of Covid and Peloton based on Google Trends using Python Pandas and matplotlib

Use Seaborn for Scatter Plot

Similar to Pandas, Seaborn also uses matplotlib as the underlying package. Seaborn provides a higher-level interface for drawing statistical graphics. Compared to Pandas, the nice thing about using Seaborn is that it can easily add the regression line into the plot. As you can, basically you can use plt from matplotlib to change the font size on the X and Y-axis.

sns.lmplot(x="Covid",y="Peloton",data=data_Covid_Peloton,fit_reg=True)
import seaborn as sns
sns.lmplot(x="Covid",y="Peloton",data=data_Covid_Peloton,fit_reg=True)
plt.xlabel('Keyword of Covid', fontsize=10)
plt.ylabel('Keyword of Peloton', fontsize=10)
plt.title("Relationship of Covid and Peloton based on Google Trends",fontdict = {'fontsize' : 10})
plt.show()
Scatter Plot of Relationship of Covid and Peloton based on Google Trends using Python Seaborn and matplotlib

Compare Scatter Plot versus Line Charts

The example above of Covid and Peloton involves time points, and thus we can also use line charts to show the relationship. By comparing scatter plots and line charts, you will have a better idea of the difference and connection between these two.

import pandas as pd
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
data_Covid_Peloton.plot()
plt.legend(fontsize=10)
plt.xlabel('Week',fontsize=10)
plt.title("Relationship of Covid and Peloton based on Google Trends",fontdict = {'fontsize' : 10})
plt.show()
Line Chart of Relationship of Covid and Peloton based on Google Trends using Python Pandas and matplotlib
Line Chart of Relationship of Covid and Peloton based on Google Trends using Python Pandas and matplotlib

Other Resource

Leave a Comment