The CatBoost model is a gradient boosting model that is based on decision trees, much like XGBoost, LightGBM, and other tree-based models. It is a very popular model for tabular data, and is often used in Kaggle competitions. It is also very fast, and can be used for real-time predictions.
In this tutorial I’ll provide example code so you can train a CatBoostClassifier model and then use Optuna to optimize the hyperparameters. Optuna is a powerful package for hyperparameter optimization, and it is very easy to use and significantly quicker than GridSearchCV or RandomizedSearchCV.
!pip3 install catboost
!pip3 install optuna
For this tutorial we’ll be using the CatBoostClassifier model from CatBoost, the Optuna package for hyperparametemr optimization, and the Pickle package to save our trained model. To evaluate the performance of our classifier we’ll use the accuracy_score
and classification_report
modules from scikit-learn.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
import optuna
from optuna.samplers import TPESampler
import catboost
import pickle
To keep things simple and allow us to focus on the task of training and tuning the CatBoost classifier, we’ll use the wine dataset from sklearn. This dataset contains 13 features and 3 classes. The goal is to predict the class of a wine based on its features. We’ll use the load_wine()
function to load the data and will get this to return a Pandas dataframe.
X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | 13.76 | 1.53 | 2.70 | 19.5 | 132.0 | 2.95 | 2.74 | 0.50 | 1.35 | 5.40 | 1.25 | 3.00 | 1235.0 |
73 | 12.99 | 1.67 | 2.60 | 30.0 | 139.0 | 3.30 | 2.89 | 0.21 | 1.96 | 3.35 | 1.31 | 3.50 | 985.0 |
29 | 14.02 | 1.68 | 2.21 | 16.0 | 96.0 | 2.65 | 2.33 | 0.26 | 1.98 | 4.70 | 1.04 | 3.59 | 1035.0 |
48 | 14.10 | 2.02 | 2.40 | 18.8 | 103.0 | 2.75 | 2.92 | 0.32 | 2.38 | 6.20 | 1.07 | 2.75 | 1060.0 |
166 | 13.45 | 3.70 | 2.60 | 23.0 | 111.0 | 1.70 | 0.92 | 0.43 | 1.46 | 10.68 | 0.85 | 1.56 | 695.0 |
If you use the Pandas value_counts()
function on the target variable y
, you’ll see that this dataset has three classes. These are not balanced, but this won’t be a massive problem for CatBoost.
y.value_counts()
1 71
0 59
2 48
Name: target, dtype: int64
Next we’ll split the data into training and test sets. We’ll use 70% of the data for training and 30% for testing by setting the test_size
parameter to 0.3. The random_state
parameter is set to 1 to ensure reproducibility of the results. If you miss this part, you could get a different split each time you run the function.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Now we have our dataset sorted, we can create and train a CatBoostClassifier model. This will be a simple base model with no hyperparameter tuning. We’ll define the model, then fit it to the training data. It should train quickly as this dataset is very small. Once that’s done, we can generate some predictions from the test data.
model = catboost.CatBoostClassifier(verbose=False)
model.fit(X_train, y_train)
<catboost.core.CatBoostClassifier at 0x7f4bbbab73d0>
y_pred = model.predict(X_test)
There are a couple of scikit-learn functions we can use to evaluate the model. The first is the accuracy_score function, which returns the accuracy of the model. The second is the classification_report function, which returns a report with the precision, recall, and F1 score for each class. As you can see, the base CatBoostClassifier is actually pretty decent even before hyperparameter tuning.
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.96 1.00 0.98 23
1 1.00 0.95 0.97 19
2 1.00 1.00 1.00 12
accuracy 0.98 54
macro avg 0.99 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
print(accuracy_score(y_test, y_pred))
0.9814814814814815
To try to eek extra performance out of our model and improve its accuracy we’ll now use the Optuna hyperparameter tuning library to find the best hyperparameters for our model. To get started, the first thing we need to do is create a custom objective function designed specifically for our CatBoostClassifier model.
This function will take in the hyperparameters we want to tune and return the accuracy of the model with those hyperparameters. We’ll then use Optuna to find the best hyperparameters for our model by running this function many times with different hyperparameter values.
def objective(trial):
model = catboost.CatBoostClassifier(
iterations=trial.suggest_int("iterations", 100, 1000),
learning_rate=trial.suggest_float("learning_rate", 1e-3, 1e-1, log=True),
depth=trial.suggest_int("depth", 4, 10),
l2_leaf_reg=trial.suggest_float("l2_leaf_reg", 1e-8, 100.0, log=True),
bootstrap_type=trial.suggest_categorical("bootstrap_type", ["Bayesian"]),
random_strength=trial.suggest_float("random_strength", 1e-8, 10.0, log=True),
bagging_temperature=trial.suggest_float("bagging_temperature", 0.0, 10.0),
od_type=trial.suggest_categorical("od_type", ["IncToDec", "Iter"]),
od_wait=trial.suggest_int("od_wait", 10, 50),
verbose=False
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return accuracy_score(y_test, y_pred)
Next we need to create an Optuna study using our objective function. We’ll use the TPE sampler, which is a good default for most problems. This uses the Tree-structured Parzen Estimator to sample the hyperparameter space. We’ll also set the direction to maximize, since we want to maximise the accuracy score. We’ll set it to run through 100 different trials. To avoid getting a message every time a trial runs, I’ve turned off verbose
mode in Optuna by manually overriding the verbosity of the logging.
optuna.logging.set_verbosity(optuna.logging.WARNING)
sampler = TPESampler(seed=1)
study = optuna.create_study(study_name="catboost", direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
After a couple of minutes, depending on the speed of your workstation, Optuna should have crunched through the trials and tried the hyperparameters that you specified. We can access the data from the study to find out which hyperparameters performed best.
print("Number of finished trials: ", len(study.trials))
print("Best trial:")
trial = study.best_trial
print(" Value: ", trial.value)
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Number of finished trials: 100
Best trial:
Value: 1.0
Params:
iterations: 503
learning_rate: 0.06564339077069614
depth: 6
l2_leaf_reg: 7.546635702360232e-06
bootstrap_type: Bayesian
random_strength: 1.4799844388224288e-07
bagging_temperature: 0.19366957870297075
od_type: IncToDec
od_wait: 20
Now that Optuna has identified the optimum combination of hyperparamters to tune our CatBoostClassifier, we can create a new model with these hyperparameters and train it on the entire dataset. We can pass in **trial.params
to the model to pass in the hyperparameters that Optuna identified as being the best.
model = catboost.CatBoostClassifier(**trial.params, verbose=False)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Finally, we can evaluate the model on the test set and see how well it performs. The base model was already pretty solid, but hyperparameter tuning has given us a further boost and we’re now hitting 100% accuracy on the test set. This is a great result, and we can be confident that our model will perform well on new data.
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 23
1 1.00 1.00 1.00 19
2 1.00 1.00 1.00 12
accuracy 1.00 54
macro avg 1.00 1.00 1.00 54
weighted avg 1.00 1.00 1.00 54
print(accuracy_score(y_test, y_pred))
1.0
Since we’ve now got a perfectly optimised machine learning model that works well on data it’s never seen, and that’s been tuned to our specific dataset, we can save it for future use. We’ll use Pickle to save the ML model to disk. This will allow us to load the model at a later date and use it to make predictions on new data without the hassle of retraining or reoptimising it.
pickle.dump(model, open("catboost_model.pkl", "wb"))
Matt Clarke, Friday, October 14, 2022