How to use scikit-learn datasets in data science projects

To learn data science techniques you’ll need the right kind of datasets. Thankfully, many are easy to access from within the scikit-learn package.

How to use scikit-learn datasets in data science projects
19 minutes to read

The scikit-learn package comes with a range of small built-in toy datasets that are ideal for using in test projects and applications. As they’re part of the scikit-learn package, you don’t even need to install them. You can simply tell scikit-learn to load them and you have instant access.

The toy datasets cover a wide range of types of data, so are ideal for quick tests in Jupyter notebooks, whether you’re learning Pandas, Numpy or practicing new approaches for supervised or unsupervised learning models. As these datasets are widely available and consistent, they’re often used in research papers on machine learning, so you’ll likely see them used in lots of examples.

How to access scikit-learn datasets

To use these datasets you’ll first need to install sklearn by typing pip3 install sklearn into your terminal, if you’ve not already installed it.

Once that’s installed, you then need to import the desired dataset from the sklearn.datasets package. There are seven to choose from: load_boston, load_iris, load_diabetes, load_digits, load_linnerud, load_wine and load_breast_cancer. Once you’ve loaded the dataset, you simply call a matching function and assign the returned object to a variable, i.e. data = load_boston().

As the object returned is just a Python dictionary, you can easily access its contents and load up the data of interest into a Pandas DataFrame, which we’ve called df in the below example. The pd.DataFrame() function is passed two arguments, one containing the data from the sklearn object and one containing the feature_names, which can be passed to Pandas to name the columns.

Boston house prices dataset (regression)

The Boston house prices dataset has been used in many machine learning papers that look at linear regression) problems, as it’s possible to use the features in the dataset to predict the values of the houses.

import pandas as pd
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
CRIM ZN INDUS CHAS ... TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 ... 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 ... 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 ... 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 ... 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 ... 222.0 18.7 396.90 5.33

5 rows × 13 columns

Iris plants dataset (classification)

The Iris dataset is one of the world’s best known and most widely used. It’s a small taxonomic dataset with 150 rows of data on three different Iris species. One of the species can be linearly separated, while two cannot, so it’s great for simple classification projects.

import pandas as pd
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Diabetes dataset (regression)

The diabetes dataset includes 10 baseline variables, like age, sex, BMI, average blood pressure and some blood parameters plus a response column which shows the progression of the disease in the patient. The data in this dataset have been mean centered and scales by the standard deviation times the number of samples (basically the sum of square for each of the column totals), which is often a useful approach when preparing data for modelling.

import pandas as pd
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
age sex bmi bp ... s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 ... -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 ... 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 ... -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 ... -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 ... 0.008142 -0.002592 -0.031991 -0.046641

5 rows × 10 columns

Optical recognition of handwritten digits dataset (classification)

The optical recognition of handwritten digits dataset is a little different as it contains data on the pixels of scanned handwritten text from a preprinted form collected from 43 different people. The data is returned as a matrix from which you can predict the letters depicted, with the right approach.

import pandas as pd
from sklearn.datasets import load_digits
data = load_digits()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 ... pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7
0 0.0 0.0 5.0 13.0 ... 10.0 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 ... 16.0 10.0 0.0 0.0
2 0.0 0.0 0.0 4.0 ... 11.0 16.0 9.0 0.0
3 0.0 0.0 7.0 15.0 ... 13.0 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 ... 16.0 4.0 0.0 0.0

5 rows × 64 columns

Linnerud dataset (multi-output regression)

The Linnerud dataset includes data on exercises undertaken by middle-aged men in a health club and can be used for multi-output regression projects.

import pandas as pd
from sklearn.datasets import load_linnerud
data = load_linnerud()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
Chins Situps Jumps
0 5.0 162.0 60.0
1 2.0 110.0 60.0
2 12.0 101.0 101.0
3 12.0 105.0 37.0
4 13.0 155.0 58.0

Wine recognition dataset (classification)

The Wine recognition dataset is an interesting and simple to use dataset for classification problems.

import pandas as pd
from sklearn.datasets import load_wine
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
alcohol malic_acid ash alcalinity_of_ash ... color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 ... 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 ... 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 ... 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 ... 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 ... 4.32 1.04 2.93 735.0

5 rows × 13 columns

Wisconsin breast cancer diagnostic dataset (classification)

Finally, the Wisconsin breast cancer diagnostic dataset contains medical data from mammograms which can be used to classify whether a patient has cancer or not.

import pandas as pd
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
mean radius mean texture mean perimeter mean area ... worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 ... 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 ... 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 ... 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 ... 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 ... 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

Extracting other information from the scikit-learn objects

As you may have noticed in the example data above, these DataFrames only contain some of the data you’ll need to build your model, because we specifically selected the data key from the object. However, the scikit-learn Bunch object that each function returns does contain some additional data you can access too. As the object is basically just a Python dictionary, you can access each component using the key and you can loop over the whole object to view its entire contents.

import pandas as pd
from sklearn.datasets import load_wine
data = load_wine()

for key,value in data.items():
    print(key,'\n',value,'\n')
data 
 [[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]] 

target 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 

frame 
 None 

target_names 
 ['class_0' 'class_1' 'class_2'] 

DESCR 
 .. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).
 

feature_names 
 ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'] 

Depending on the dataset, you’ll find various values in here including: data which holds the raw data, target which holds the target variable for modelling, target_names which holds the target column name or names, DESCR which contains a detailed description of the data, and feature_names which includes the names of the features that you can use in your Pandas DataFrame. You can view any of these by typing print(data['target_names']), where target_names is the key you want to view.

Returning X and y for use in your models

Another neat trick you can do with these datasets is pass an argument to the load function to tell it to return both the data and the target and assign them to X and y for direct use in your model. By passing the argument as_frame=True you can also return the X data as a Pandas DataFrame. As the y column usually only contains one value, it will generally be returned as a series, but a multi-class dataset like load_linnerud will return y as a DataFrame too.

import pandas as pd
from sklearn.datasets import load_linnerud
X, y = load_linnerud(return_X_y=True, as_frame=True)
X.head()
Chins Situps Jumps
0 5.0 162.0 60.0
1 2.0 110.0 60.0
2 12.0 101.0 101.0
3 12.0 105.0 37.0
4 13.0 155.0 58.0
y.head()
Weight Waist Pulse
0 191.0 36.0 50.0
1 189.0 37.0 52.0
2 193.0 38.0 58.0
3 162.0 35.0 62.0
4 189.0 35.0 46.0

Matt Clarke, Monday, March 01, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Comments