When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%) used as a test or holdout group to validate the model’s performance.
While this approach works, the model can be thrown off course if certain values appear in either group in different proportions to the way they might when the model is presented with real world data.
Making the test or validation dataset larger can make the model more reliable, but it means there’s less data to use for training, which could mean the model performs worse. One solution to this problem is a machine learning technique called cross validation.
Cross validation is a machine learning technique whereby the data are divided into equal groups called “folds” and the training process is run a number of times, each time using a different portion of the data, or “fold”, for validation.
For example, let’s say you created five folds. This would divide your data into five equal portions or folds. In the first experiment you’d train your model on data from folds 2-5 and use fold 1 for validation, then record your evaluation metric.
On the second experiment, you’d train your model on folds 1 and 3-5, and use fold 2 for validation, then record your evaluation metric. The process is repeated for each fold until you’ve trained your model and used each fold as a holdout at some point.
Cross validation gives you a better idea of the performance of your model because it uses all the data to validate performance during the training process. The downside is that it can be a bit slower, as you’re running the process typically five times.
That said, it’s very useful, especially during the model selection process. During model selection, you’d run k-fold cross validation for each model, then calculate the performance of each model, and then decide which model performed best.
Model selection is sometimes overlooked, but it’s arguably the single biggest way to improve model performance. Some models perform much better than others on certain, and it pays to experiment a little to find the best one for the job.
In this project, we’ll use scikit-learn in Python to build a simple Random Forest model using RandomForestRegressor
and then apply cross validation to better understand the model’s performance.
To keep things simple we’ll use one of the built-in scikit-learn datasets called the California Housing dataset. This is ideal for regression problems and doesn’t require any cleaning or processing, so we can focus on the main topic. We’ll be using RandomForestRegressor
to create a regression model, the train_test_split
function to split the data, cross_val_score
to calculate the cross validation score for each fold, and mean_absolute_error
to evaluate performance.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.datasets import fetch_california_housing
We’ll start by loading the California Housing data into a Pandas dataframe using the fetch_california_housing()
function with the return_X_y=True
and as_frame=True
arguments which provide separate dataframes for the X and y data.
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X.sample(5)
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|
9613 | 1.5635 | 17.0 | 3.412651 | 1.043675 | 3200.0 | 4.819277 | 37.29 | -120.48 |
15080 | 2.4367 | 15.0 | 4.019447 | 1.041965 | 2204.0 | 2.255885 | 32.80 | -116.97 |
1106 | 2.4537 | 10.0 | 5.583333 | 0.983974 | 721.0 | 2.310897 | 39.80 | -121.60 |
8643 | 6.2356 | 42.0 | 4.636086 | 1.042813 | 634.0 | 1.938838 | 33.88 | -118.40 |
1534 | 4.2625 | 37.0 | 5.925795 | 0.950530 | 689.0 | 2.434629 | 37.89 | -122.05 |
Next, we’ll create a regression model without using cross validation. We’ll first split the data up into the train and test groups using train_test_split()
and will allocate 30% to the test or validation group. We’ll then fit and train a basic RandomForestRegressor
model, then generate predictions using predict()
and calculate the mean absolute error or MAE.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
model = RandomForestRegressor(n_estimators=100, random_state=1)
model.fit(X_train, y_train)
RandomForestRegressor(random_state=1)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mae
0.33364802046188646
To create a Random Forest model with cross validation it’s generally easiest to use a scikit-learn model pipeline. Ours is a very basic one, since our data doesn’t require preprocessing, but you can easily slot in additional steps to encode variables or scale data, making this a cleaner and more efficient way to write your model code. Our very simple pipeline has a single step that defines our model in the same way as we did above.
model_pipeline = Pipeline(steps=[('model', RandomForestRegressor(n_estimators=100, random_state=1))])
Next, we’ll use cross_val_score()
and will pass it the model_pipeline
and the original X
and y
data, instead of the data we split using train_test_split
. We’ll set the cv
value to 5, which defines that we’ll use five folds, and we’ll set the scoring
argument to use neg_mean_absolute_error
.
The neg_mean_absolute_error
is a bit weird, because it returns a negative mean absolute error, unlike the mean_absolute_error()
function, which returns a positive value. Therefore, to make these comparable, we’ll multiply each score by -1 to turn the negative into a positive.
scores = -1 * cross_val_score(model_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
If you run the code, after a few minutes of crunching, the model will spit out a list of MAE values, one for each fold. As you can see from ours, we get slightly different scores for each fold tested. You can use mean()
to return the mean score for all the folds, which is the usual way to check fold performance during model selection.
scores
array([0.54535132, 0.40632189, 0.4386851 , 0.46471479, 0.47538634])
scores.mean()
0.4660918895978684
Matt Clarke, Saturday, May 07, 2022