The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for its high speed and accuracy. LightGBM was originally developed by Microsoft and is now an open source project. It is often used in machine learning competitions, and is a popular choice for Kaggle users.
LightGBM has lots of advantages over other gradient boosting frameworks. It’s fast, scalable, and has a lower memory usage than XGBoost. It also has a higher accuracy than other frameworks, and is able to handle large datasets. Like XGBoost, it also supports parallel and GPU learning, making it blazingly fast if you’ve got a powerful GPU.
In this post, we will use the LightGBM model to create a classification model and tune its hyperparameters using Optuna.
To get started, open a Jupyter notebook and install the LightGBM and Optuna packages from the Pip package management system. You can do this from within the notebook by putting an exclamation mark before the pip3 install command and then executing the code cell.
!pip3 install optuna
!pip3 install lightgbm
Next we’ll load the packages we need for this project. We’ll be using LightGBM for our model and Optuna for hyperparameter tuning. We’ll need the train_test_split
module from scikit-learn to split our training and test data, and the accuracy_score
and classification_report
modules to evaluate our model. We’ll save our trained ML model using Pickle.
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
import optuna
from optuna.samplers import TPESampler
import pickle
Next, we’ll load our dataset. To keep things simple we’ll use the wine dataset built into scikit-learn, as we can then skip out some of the feature engineering and data cleansing tasks you’d undertake when building a model and focus on the model training and tuning. We’ll pass True
to the return_X_y
parameter to get back a X
and y
data and return this as a Pandas dataframe using as_frame=True
.
X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
84 | 11.84 | 0.89 | 2.58 | 18.0 | 94.0 | 2.20 | 2.21 | 0.22 | 2.35 | 3.05 | 0.79 | 3.08 | 520.0 |
132 | 12.81 | 2.31 | 2.40 | 24.0 | 98.0 | 1.15 | 1.09 | 0.27 | 0.83 | 5.70 | 0.66 | 1.36 | 560.0 |
58 | 13.72 | 1.43 | 2.50 | 16.7 | 108.0 | 3.40 | 3.67 | 0.19 | 2.04 | 6.80 | 0.89 | 2.87 | 1285.0 |
143 | 13.62 | 4.95 | 2.35 | 20.0 | 92.0 | 2.00 | 0.80 | 0.47 | 1.02 | 4.40 | 0.91 | 2.05 | 550.0 |
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
If you use the value_counts()
function to print the target variable values stored in y
you’ll see that we have three classes. The classes are not balanced, but that’s not a problem for this experiment.
y.value_counts()
1 71
0 59
2 48
Name: target, dtype: int64
Now we need to split up our data into training and test sets. We will use 70% of the data for training and 30% for testing by defining the test_size
value as 0.3. We’ll also add a random_state
value to ensure we get the same results each time we run the task.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Now we can use LightGBM to create a classification model via the LGBMClassifier class. We will use the default parameters for now as this is just a base model and we’re going to use Optuna to determine the optimal parameters for maximising the results on our dataset. Once we’ve defined the model, we’ll use fit()
to train this on our training data.
base_model = lgb.LGBMClassifier()
base_model.fit(X_train, y_train)
LGBMClassifier()
Now the LightGBM classification model has been trained, we can use it to make predictions on the test data. The predictions are stored in the variable y_pred
, which is a Numpy array. If you print this you’ll see the class predicted for each row in the test dataset.
y_pred = base_model.predict(X_test)
y_pred
array([2, 1, 0, 1, 0, 2, 1, 0, 2, 1, 0, 0, 1, 0, 1, 1, 2, 0, 1, 0, 0, 1,
2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 0, 0, 1, 2, 0,
0, 0, 1, 0, 0, 0, 1, 2, 2, 0])
To evaluate the performance of our classifier we’ll use two metrics from scikit-learn: accuracy and classification report. The accuracy is the number of correct predictions divided by the total number of predictions. The classification report provides a breakdown of each class by precision, recall, f1-score and support.
The scores we gain are already very good, with an accuracy of 98.148. However, we might be able to get further improvement by tuning the model’s hyperparameters using Optuna.
accuracy_score(y_test, y_pred)
0.9814814814814815
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.96 1.00 0.98 23
1 1.00 0.95 0.97 19
2 1.00 1.00 1.00 12
accuracy 0.98 54
macro avg 0.99 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
To try to maximise the performance of our LightGBM classification model we’ll now tune the model’s hyperparameters. Hyperparameters are the model’s internal settings and making fine adjustments to them can yield greater accuracy and better overal results. We’ll use Optuna for our hyperparameter tuning as it’s significantly quicker than scikit-learn’s GridSearch tuning module and often generates better results.
To use Optuna you first need to create an objective function. This includes a dictionary of the model’s hyperparameters you want to test, as well as the ranges of values you want to cover during testing. Optuna will do a series of runs and test different combinations of hyperparameters by fitting them to your model and then measuring the accuracy (or whatever objective you set) before finally returning the best parameters.
def objective(trial):
"""
Objective function to be minimized.
"""
param = {
"objective": "multiclass",
"metric": "multi_logloss",
"verbosity": -1,
"boosting_type": "gbdt",
"num_class": 3,
"lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
"lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
"num_leaves": trial.suggest_int("num_leaves", 2, 256),
"feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
"bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
"bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
"min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
}
gbm = lgb.LGBMClassifier(**param)
gbm.fit(X_train, y_train)
preds = gbm.predict(X_test)
accuracy = accuracy_score(y_test, preds)
return accuracy
To run the Optuna study and identify the best hyperparameters for our LightGBMClassifier model we need to create a sampler. We’re using TPESampler
, which uses the Tree-Structured Parzen Estimator algorithm. We want to maximise the accuracy of our model during tuning, so we’ll pass in the maximize
argument to create_study()
along with our sampler
. We’ll then use optimize()
to run 100 trials against our objective function.
sampler = TPESampler(seed=1)
study = optuna.create_study(study_name="lightgbm", direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
To examine the results of our Optuna study we can print some values returned in the study
variable. We can see that we ran 100 trials and that trial number 14 generated the best results, with an accuracy of 1.0 or 100%. By looping over the trial.params.items()
we can see what the winning hyperparameters were and use them in our final tuned model.
print('Best parameters:', study.best_params)
Best parameters: {'lambda_l1': 9.818554108154862, 'lambda_l2': 2.4055010791348247e-06, 'num_leaves': 4, 'feature_fraction': 0.5515741134287729, 'bagging_fraction': 0.6255538253881087, 'bagging_freq': 2, 'min_child_samples': 17}
print('Best value:', study.best_value)
Best value: 1.0
print('Best trial:', study.best_trial)
Best trial: FrozenTrial(number=14, values=[1.0], datetime_start=datetime.datetime(2022, 10, 14, 7, 7, 43, 224346), datetime_complete=datetime.datetime(2022, 10, 14, 7, 7, 43, 259048), params={'lambda_l1': 9.818554108154862, 'lambda_l2': 2.4055010791348247e-06, 'num_leaves': 4, 'feature_fraction': 0.5515741134287729, 'bagging_fraction': 0.6255538253881087, 'bagging_freq': 2, 'min_child_samples': 17}, distributions={'lambda_l1': FloatDistribution(high=10.0, log=True, low=1e-08, step=None), 'lambda_l2': FloatDistribution(high=10.0, log=True, low=1e-08, step=None), 'num_leaves': IntDistribution(high=256, log=False, low=2, step=1), 'feature_fraction': FloatDistribution(high=1.0, log=False, low=0.4, step=None), 'bagging_fraction': FloatDistribution(high=1.0, log=False, low=0.4, step=None), 'bagging_freq': IntDistribution(high=7, log=False, low=1, step=1), 'min_child_samples': IntDistribution(high=100, log=False, low=5, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=14, state=TrialState.COMPLETE, value=None)
Finally, we can pass the best hyperparameters identified by Optuna back to LGBClassifier and fit our final model with the ideal parameters to maximise model accuracy. To do this, there’s no need to manually pass in a dictionary of params
as you would do normally. Instead, you can simply pass in **study.best_params
and it will provide this for you.
model = lgb.LGBMClassifier(**study.best_params)
model.fit(X_train, y_train)
[LightGBM] [Warning] lambda_l1 is set=9.818554108154862, reg_alpha=0.0 will be ignored. Current value: lambda_l1=9.818554108154862
[LightGBM] [Warning] bagging_fraction is set=0.6255538253881087, subsample=1.0 will be ignored. Current value: bagging_fraction=0.6255538253881087
[LightGBM] [Warning] lambda_l2 is set=2.4055010791348247e-06, reg_lambda=0.0 will be ignored. Current value: lambda_l2=2.4055010791348247e-06
[LightGBM] [Warning] feature_fraction is set=0.5515741134287729, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.5515741134287729
[LightGBM] [Warning] bagging_freq is set=2, subsample_freq=0 will be ignored. Current value: bagging_freq=2
LGBMClassifier(bagging_fraction=0.6255538253881087, bagging_freq=2,
feature_fraction=0.5515741134287729, lambda_l1=9.818554108154862,
lambda_l2=2.4055010791348247e-06, min_child_samples=17,
num_leaves=4)
Now that’s been trained, we can run the tuned model on our test data again and evaluate its performance using the accuracy score and the classification report. The Optuna hyperparameter tuning did the trick and our model now achieves perfect accuracy across all classes.
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
1.0
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 23
1 1.00 1.00 1.00 19
2 1.00 1.00 1.00 12
accuracy 1.00 54
macro avg 1.00 1.00 1.00 54
weighted avg 1.00 1.00 1.00 54
Finally, we’ll save the model using Pickle. Using Pickle to save the model means we can load it later and use it to make predictions on new data without the need to retrain it.
filename = "lightgbm.pkl"
pickle.dump(model, open(filename, "wb"))
Matt Clarke, Thursday, January 19, 2023