When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%) used as a test or holdout group to validate the model’s performance.
While this approach works, the model can be thrown off course if certain values appear in either group in different proportions to the way they might when the model is presented with real world data.
Making the test or validation dataset larger can make the model more reliable, but it means there’s less data to use for training, which could mean the model performs worse. One solution to this problem is a machine learning technique called cross validation.
Cross validation is a machine learning technique whereby the data are divided into equal groups called “folds” and the training process is run a number of times, each time using a different portion of the data, or “fold”, for validation.
For example, let’s say you created five folds. This would divide your data into five equal portions or folds. In the first experiment you’d train your model on data from folds 2-5 and use fold 1 for validation, then record your evaluation metric.
On the second experiment, you’d train your model on folds 1 and 3-5, and use fold 2 for validation, then record your evaluation metric. The process is repeated for each fold until you’ve trained your model and used each fold as a holdout at some point.
Cross validation gives you a better idea of the performance of your model because it uses all the data to validate performance during the training process. The downside is that it can be a bit slower, as you’re running the process typically five times.
That said, it’s very useful, especially during the model selection process. During model selection, you’d run k-fold cross validation for each model, then calculate the performance of each model, and then decide which model performed best.
Model selection is sometimes overlooked, but it’s arguably the single biggest way to improve model performance. Some models perform much better than others on certain, and it pays to experiment a little to find the best one for the job.
In this project, we’ll use scikit-learn in Python to build a simple Random Forest model using
RandomForestRegressor and then apply cross validation to better understand the model’s performance.
To keep things simple we’ll use one of the built-in scikit-learn datasets called the California Housing dataset. This is ideal for regression problems and doesn’t require any cleaning or processing, so we can focus on the main topic. We’ll be using
RandomForestRegressor to create a regression model, the
train_test_split function to split the data,
cross_val_score to calculate the cross validation score for each fold, and
mean_absolute_error to evaluate performance.
import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.metrics import mean_absolute_error from sklearn.datasets import fetch_california_housing
We’ll start by loading the California Housing data into a Pandas dataframe using the
fetch_california_housing() function with the
as_frame=True arguments which provide separate dataframes for the X and y data.
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
Next, we’ll create a regression model without using cross validation. We’ll first split the data up into the train and test groups using
train_test_split() and will allocate 30% to the test or validation group. We’ll then fit and train a basic
RandomForestRegressor model, then generate predictions using
predict() and calculate the mean absolute error or MAE.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
model = RandomForestRegressor(n_estimators=100, random_state=1) model.fit(X_train, y_train)
y_pred = model.predict(X_test) mae = mean_absolute_error(y_test, y_pred) mae
To create a Random Forest model with cross validation it’s generally easiest to use a scikit-learn model pipeline. Ours is a very basic one, since our data doesn’t require preprocessing, but you can easily slot in additional steps to encode variables or scale data, making this a cleaner and more efficient way to write your model code. Our very simple pipeline has a single step that defines our model in the same way as we did above.
model_pipeline = Pipeline(steps=[('model', RandomForestRegressor(n_estimators=100, random_state=1))])
Next, we’ll use
cross_val_score() and will pass it the
model_pipeline and the original
y data, instead of the data we split using
train_test_split. We’ll set the
cv value to 5, which defines that we’ll use five folds, and we’ll set the
scoring argument to use
neg_mean_absolute_error is a bit weird, because it returns a negative mean absolute error, unlike the
mean_absolute_error() function, which returns a positive value. Therefore, to make these comparable, we’ll multiply each score by -1 to turn the negative into a positive.
scores = -1 * cross_val_score(model_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
If you run the code, after a few minutes of crunching, the model will spit out a list of MAE values, one for each fold. As you can see from ours, we get slightly different scores for each fold tested. You can use
mean() to return the mean score for all the folds, which is the usual way to check fold performance during model selection.
array([0.54535132, 0.40632189, 0.4386851 , 0.46471479, 0.47538634])
Matt Clarke, Saturday, May 07, 2022