Introduction
This tutorial shows how to plot line charts, bar charts, and scatter plots in Python. The major packages being used include Pandas, Matplotlib, and Seaborn. Note that, Pandas plot functions and Seaborn build on the top of Matplotlib, and thus you can use some functions from Matplotlib. In some situations, I found it easier to use Pandas and Seaborn first. Then on the top of this plot, we further use Matplotlib to modify some parameters in the chart.
Line Charts
Line charts are typically used to show an overall trend of a certain topic. For instance, you can use a line chart to show the overall price movement of a stock or people’s interest in a certain topic or object.
In particular, I am going to use Peloton’s Google Trends data to plot the overall trends of consumers’ interest in Peloton. The code below reads the CSV data file into the environment and print it out. (To learn more about dataframe, you can refer to this post.)
import pandas as pd
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
print(data_Peloton)
Week Peleton 0 3/10/2019 10 1 3/17/2019 11 2 3/24/2019 10 3 3/31/2019 9 4 4/7/2019 8 .. ... ... 156 3/6/2022 19 157 3/13/2022 17 158 3/20/2022 16 159 3/27/2022 15 160 4/3/2022 15 [161 rows x 2 columns]
We can use matplotlib’s plot function to plot this line chart. The basic structure of the plot function is as follows. That is, it needs you to specify a variable on x-axis and a variable on y-axis.
plt.plot('xlabel', 'ylabel', data=obj)
After adding the line of code above, the following is the most simple version of the complete code.
import pandas as pd
import matplotlib.pyplot as plt
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
plt.plot('Week', 'Peloton',data=data_Peloton)
plt.show()
The following is the figure output. As you can see, while the y-axis looks great, the x-axis looks very crowded. To solve that problem, we need to use reduce the number of labels shown on the x-axis.
Add X-axis Intervals and Title
To specify the interval, there are quite a few different methods of doing that, see the figure below.
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(30))
plt.title("Google Trends of Peloton",fontdict = {'fontsize' : 10})
Below is the complete version of the updated python code.
import pandas as pd
import matplotlib.pyplot as plt
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
plt.plot('Week', 'Peloton',data=data_Peloton)
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(30))
plt.title("Google Trends of Peloton",fontdict = {'fontsize' : 10})
Bar Charts
You can also use bar charts for a similar job as line charts. In other words, often you will find that bar charts and line charts are interchangeable. However, when there are multiple variables on the Y axis, it might not be a good idea to use bar charts as the figure will get crowded. In this case, you might want to consider using line charts. Below, I use examples to show what I mean here.
Single Y Variable using Bar Charts
import pandas as pd
import matplotlib.pyplot as plt
data_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Peloton_Google_Trends.csv")
plt.bar('Week', 'Peloton',data=data_Peloton)
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(30))
plt.title("Google Trends of Peloton",fontdict = {'fontsize' : 10})
plt.show()
The following is the chart. As we can see, it is not much different from what is shown in the line chart shown above. You should note that this is the case where there is only one variable on Y-axis. However, in some cases, you might have multiple variables on Y-axis. Below I will show the case.
Multiple Y Variables using Bar Charts
The data set below includes multiple columns of data, namely RD Expenses, Sales and Marketing, and General Admin Expenses. We can plot them into the same bar chart, namely put all 3 on the Y-axis, whereas the Quarter column is on the X-axis.
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
print(data_MSFT_T)
Quarter RD Expenses Sales and Marketing General Admin Expenses 1 2017Q1 3355 3879 1202 2 2017Q2 3514 4356 1355 3 2017Q3 3574 3812 1166 4 2017Q4 3504 4562 1109 5 2018Q1 3715 4335 1208 6 2018Q2 3933 4760 1271 7 2018Q3 3977 4098 1149 8 2018Q4 4070 4588 1132 9 2019Q1 4316 4565 1179 10 2019Q2 4513 4962 1425 11 2019Q3 4565 4337 1061 12 2019Q4 4603 4933 1121 13 2020Q1 4887 4911 1273 14 2020Q2 5214 5417 1656 15 2020Q3 4926 4231 1119 16 2020Q4 4899 4947 1139 17 2021Q1 5204 5082 1327 18 2021Q2 5687 5857 1522 19 2021Q3 5599 4547 1287 20 2021Q4 5758 5379 1384
The following is the complete Python code and figure output and figure output. When looking at the code below, you should notice that the following code line does not use plt
, which is directly from the package of matplotlib. Thus, the code line below is using Pandas function of pandas.DataFrame.plot, which is built on the top of matplotlib. That is why in the end, you have to include plt.show()
.
data_MSFT_T.plot(x='Quarter', kind='bar', stacked=False,)
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T.plot(x='Quarter', kind='bar', stacked=False,)
plt.show()
We can change the font size of the x-axis and legend using the functions from matplotlib. Finally, similar to the line chart above, we can also add a title and set its font size. Below is the updated version of the complete code and figure output.
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T.plot(x='Quarter', kind='bar', stacked=False)
plt.xlabel('Quarters', fontsize=10)
plt.legend(fontsize=8)
plt.title("Microsoft Operating Expenses",fontdict = {'fontsize' : 10})
plt.show()
Compare Bar Charts versus Line Charts
As mentioned, the bar chart above is a bit crowded, we can use a line chart instead. Pandas plot function has line charts as the default. Thus, we can just use df.plot()
to plot a line chart.
The kind of plot to produce: ‘line’ : line plot (default) ‘bar’ : vertical bar plot ‘barh’ : horizontal bar plot ‘hist’ : histogram ‘box’ : boxplot ‘kde’ : Kernel Density Estimation plot ‘density’ : same as ‘kde’ ‘area’ : area plot ‘pie’ : pie plot ‘scatter’ : scatter plot (DataFrame only) ‘hexbin’ : hexbin plot (DataFrame only)
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T.plot()
plt.xlabel('Quarters', fontsize=10)
plt.legend(fontsize=10)
plt.title("Microsoft Operating Expenses",fontdict = {'fontsize' : 10})
plt.show()
Improved Version
You might notice the problem with the X-axis, namely the scale is not using the quarter column data. This is due to the fact that the plot function just uses the row index to plot the X-axis by default. With that, we can set_index
the column of the quarter as the row index of the dataframe. Below is the improved Python code. Personally, I prefer to use line charts when there is more than one Y value since it will look less crowded.
import pandas as pd
import matplotlib.pyplot as plt
data_MSFT_T=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/data_MSFT_T.csv")
data_MSFT_T=data_MSFT_T.set_index('Quarter')
data_MSFT_T.plot()
plt.xlabel('Quarters', fontsize=10)
plt.legend(fontsize=10)
plt.title("Microsoft Operating Expenses",fontdict = {'fontsize' : 10})
plt.show()
Scatter Plot
Use Pandas for Scatter Plot
A scatter plot is used to show the relationship between two variables. For instance, if you want to understand how people’s concerns about Covid are related to interest in buying Peloton, we can do a scatter plot to see whether there is any relationship, upward trend or downward trend, or just flat. The following is the data being used, namely data from Google Trends.
import pandas as pd
import matplotlib.pyplot as plt
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
print(data_Covid_Peloton)
Week Peloton Covid 0 2/23/2020 26 2 1 3/1/2020 24 5 2 3/8/2020 24 20 3 3/15/2020 67 53 4 3/22/2020 69 62 .. ... ... ... 106 3/6/2022 32 16 107 3/13/2022 29 16 108 3/20/2022 27 14 109 3/27/2022 25 14 110 4/3/2022 25 14 [111 rows x 3 columns]
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
data_Covid_Peloton.plot(kind="scatter",x="Covid",y="Peloton")
plt.xlabel('Keyword of Covid', fontsize=10)
plt.ylabel('Keyword of Peloton', fontsize=10)
plt.title("Relationship of Covid and Peloton based on Google Trends",fontdict = {'fontsize' : 10})
plt.show()
Use Seaborn for Scatter Plot
Similar to Pandas, Seaborn also uses matplotlib as the underlying package. Seaborn provides a higher-level interface for drawing statistical graphics. Compared to Pandas, the nice thing about using Seaborn is that it can easily add the regression line into the plot. As you can, basically you can use plt
from matplotlib to change the font size on the X and Y-axis.
sns.lmplot(x="Covid",y="Peloton",data=data_Covid_Peloton,fit_reg=True)
import seaborn as sns
sns.lmplot(x="Covid",y="Peloton",data=data_Covid_Peloton,fit_reg=True)
plt.xlabel('Keyword of Covid', fontsize=10)
plt.ylabel('Keyword of Peloton', fontsize=10)
plt.title("Relationship of Covid and Peloton based on Google Trends",fontdict = {'fontsize' : 10})
plt.show()
Compare Scatter Plot versus Line Charts
The example above of Covid and Peloton involves time points, and thus we can also use line charts to show the relationship. By comparing scatter plots and line charts, you will have a better idea of the difference and connection between these two.
import pandas as pd
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
data_Covid_Peloton.plot()
plt.legend(fontsize=10)
plt.xlabel('Week',fontsize=10)
plt.title("Relationship of Covid and Peloton based on Google Trends",fontdict = {'fontsize' : 10})
plt.show()