Built-in Sample Datasets in Python

There are built-in datasets in Python and you can use them to do some practice. In doing so, you do not need to import external datasets. The following provides a list of built-in sample datasets in Python.

1. penguins in seaborn

The penguins dataset was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. For more information about this dataset, you can refer to this post.

# Built-in sample dataset of penguins in seaborn
import seaborn as sns

# load the penguins dataset from seaborn
penguins = sns.load_dataset("penguins")

# print the penguins dataset


    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    Male  
1         3800.0  Female  
2         3250.0  Female  
3            NaN     NaN  
4         3450.0  Female  
..           ...     ...  
339          NaN     NaN  
340       4850.0  Female  
341       5750.0    Male  
342       5200.0  Female  
343       5400.0    Male  

[344 rows x 7 columns]

2. iris in statsmodels

The Iris flower dataset is from the British statistician and biologist Ronald Fisher in his 1936 paper.

# Built-in sample dataset of iris in statsmodels
import statsmodels.api as sm

# load the iris dataset from statsmodels
iris = sm.datasets.get_rdataset('iris').data

# print the iris dataset


     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]

3. iris in sklearn

# Built-in sample dataset of iris in sklearn
from sklearn.datasets import load_iris

# load the iris dataset 
iris = load_iris()

# print the iris dataset

The following is a partial output, as the array is quite long. (As shown in the last section, iris dataset has 149 rows in a dataframe.)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]

Further Reading

Overview of Python built-in data types

What is Python and how to get started

How to read CSV or Excel files in Pandas