How to use cross validation in scikit-learn machine learning models

Picture by Miguel Á. Padriñán, Pexels.

8 minutes to read

Machine Learning scikit-learn

When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%) used as a test or holdout group to validate the model’s performance.

While this approach works, the model can be thrown off course if certain values appear in either group in different proportions to the way they might when the model is presented with real world data.

Making the test or validation dataset larger can make the model more reliable, but it means there’s less data to use for training, which could mean the model performs worse. One solution to this problem is a machine learning technique called cross validation.

What is cross validation?

Cross validation is a machine learning technique whereby the data are divided into equal groups called “folds” and the training process is run a number of times, each time using a different portion of the data, or “fold”, for validation.

For example, let’s say you created five folds. This would divide your data into five equal portions or folds. In the first experiment you’d train your model on data from folds 2-5 and use fold 1 for validation, then record your evaluation metric.

On the second experiment, you’d train your model on folds 1 and 3-5, and use fold 2 for validation, then record your evaluation metric. The process is repeated for each fold until you’ve trained your model and used each fold as a holdout at some point.

Why is cross validation useful?

Cross validation gives you a better idea of the performance of your model because it uses all the data to validate performance during the training process. The downside is that it can be a bit slower, as you’re running the process typically five times.

That said, it’s very useful, especially during the model selection process. During model selection, you’d run k-fold cross validation for each model, then calculate the performance of each model, and then decide which model performed best.

Model selection is sometimes overlooked, but it’s arguably the single biggest way to improve model performance. Some models perform much better than others on certain, and it pays to experiment a little to find the best one for the job.

In this project, we’ll use scikit-learn in Python to build a simple Random Forest model using RandomForestRegressor and then apply cross validation to better understand the model’s performance.

Load the packages

To keep things simple we’ll use one of the built-in scikit-learn datasets called the California Housing dataset. This is ideal for regression problems and doesn’t require any cleaning or processing, so we can focus on the main topic. We’ll be using RandomForestRegressor to create a regression model, the train_test_split function to split the data, cross_val_score to calculate the cross validation score for each fold, and mean_absolute_error to evaluate performance.

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.datasets import fetch_california_housing

Load the data

We’ll start by loading the California Housing data into a Pandas dataframe using the fetch_california_housing() function with the return_X_y=True and as_frame=True arguments which provide separate dataframes for the X and y data.

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

X.sample(5)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
9613	1.5635	17.0	3.412651	1.043675	3200.0	4.819277	37.29	-120.48
15080	2.4367	15.0	4.019447	1.041965	2204.0	2.255885	32.80	-116.97
1106	2.4537	10.0	5.583333	0.983974	721.0	2.310897	39.80	-121.60
8643	6.2356	42.0	4.636086	1.042813	634.0	1.938838	33.88	-118.40
1534	4.2625	37.0	5.925795	0.950530	689.0	2.434629	37.89	-122.05

Create a model without cross validation

Next, we’ll create a regression model without using cross validation. We’ll first split the data up into the train and test groups using train_test_split() and will allocate 30% to the test or validation group. We’ll then fit and train a basic RandomForestRegressor model, then generate predictions using predict() and calculate the mean absolute error or MAE.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

model = RandomForestRegressor(n_estimators=100, random_state=1)
model.fit(X_train, y_train)

RandomForestRegressor(random_state=1)

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mae

0.33364802046188646

Create a model with cross validation

To create a Random Forest model with cross validation it’s generally easiest to use a scikit-learn model pipeline. Ours is a very basic one, since our data doesn’t require preprocessing, but you can easily slot in additional steps to encode variables or scale data, making this a cleaner and more efficient way to write your model code. Our very simple pipeline has a single step that defines our model in the same way as we did above.

model_pipeline = Pipeline(steps=[('model', RandomForestRegressor(n_estimators=100, random_state=1))])

Next, we’ll use cross_val_score() and will pass it the model_pipeline and the original X and y data, instead of the data we split using train_test_split. We’ll set the cv value to 5, which defines that we’ll use five folds, and we’ll set the scoring argument to use neg_mean_absolute_error.

The neg_mean_absolute_error is a bit weird, because it returns a negative mean absolute error, unlike the mean_absolute_error() function, which returns a positive value. Therefore, to make these comparable, we’ll multiply each score by -1 to turn the negative into a positive.

scores = -1 * cross_val_score(model_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

If you run the code, after a few minutes of crunching, the model will spit out a list of MAE values, one for each fold. As you can see from ours, we get slightly different scores for each fold tested. You can use mean() to return the mean score for all the folds, which is the usual way to check fold performance during model selection.

scores

array([0.54535132, 0.40632189, 0.4386851 , 0.46471479, 0.47538634])

scores.mean()

0.4660918895978684

Matt Clarke, Saturday, May 07, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.