How to use the Isolation Forest model for outlier detection

Learn how to use the Isolation Forest or iForest algorithm in sklearn for automated outlier detection in machine learning.

How to use the Isolation Forest model for outlier detection
Picture by Niilo Isotalo, Unsplash.
14 minutes to read

Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets, detecting and removing outliers is much harder, so data scientists often apply automated anomaly detection algorithms, such as the Isolation Forest, to help identify and remove outliers.

As the name suggests, the Isolation Forest is a tree-based anomaly detection algorithm. It uses an unsupervised learning approach to detect unusual data points which can then be removed from the training data. The re-training of the model on a data set with the outliers removed generally sees performance increase.

How is Isolation Forest used?

The basic idea is that you fit a base classification or regression model to your data to use as a benchmark, and then fit an outlier detection algorithm model such as an Isolation Forest to detect outliers in the training data set. The detected outliers are then removed from the training data and you re-fit the model to the new data to see if the performance improves.

Like other models, Isolation Forest models do require hyperparameter tuning to generate their best results, particularly the important “contamination” value. While you can try random settings until you find a selection that gives good results, you’ll generate the biggest performance boost by using a grid search technique with cross validation. Here’s how it’s done.

Load the packages

For this simplified example we’re going to fit an XGBRegressor regression model, train an Isolation Forest model to remove the outliers, and then re-fit the XGBRegressor with the new training data set. Load the packages into a Jupyter notebook and install anything you don’t have by entering pip3 install package-name.

import time
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import IsolationForest
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")

Load the data

You can use any data set, but I’ve used the California housing data set, because I know it includes some outliers that impact the performance of regression models.

You can load the data set into Pandas via my GitHub repository to save downloading it. The aim of the model will be to predict the median_house_value from a range of other features.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv')
df.head().T
0 1 2 3 4
longitude -122.23 -122.22 -122.24 -122.25 -122.25
latitude 37.88 37.86 37.85 37.85 37.85
housing_median_age 41 21 52 52 52
total_rooms 880 7099 1467 1274 1627
total_bedrooms 129 1106 190 235 280
population 322 2401 496 558 565
households 126 1138 177 219 259
median_income 8.3252 8.3014 7.2574 5.6431 3.8462
median_house_value 452600 358500 352100 341300 342200
ocean_proximity NEAR BAY NEAR BAY NEAR BAY NEAR BAY NEAR BAY

Prepare the data

Next, I’ve done some data prep work. The ocean_proximity column is a categorical variable, so I’ve lowercased the column values and used get_dummies() to one-hot encoded the data. Then I’ve dropped the collinear columns households, bedrooms, and population and used zero-imputation to fill in any missing values.

df['ocean_proximity'] = df['ocean_proximity'].str.lower().replace('[^0-9a-zA-Z]+','_',regex=True)
encodings = pd.get_dummies(df['ocean_proximity'], prefix='proximity')
df = pd.concat([df, encodings], axis=1)

df = df.drop(["households","total_bedrooms","population"],axis=1)

df = df.fillna(0)

Create the base model

Now the data are sorted, we’ll drop the ocean_proximity column, split the data into the train and test datasets, and scale the data using StandardScaler() so the various column values are on an even scale.

X = df.drop(['median_house_value','ocean_proximity'], axis=1)
y = df['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train.shape
(14448, 10)

Once the data are split and scaled, we’ll fit a default and un-tuned XGBRegressor() model to the training data and use cross validation to determine the mean squared error for the 10 folds and the Root Mean Squared error from the test data set. We’ll use this as our baseline result to which we can compare the tuned results. This gives us an RMSE of 49,495 on the test data and a score of 48,810 on the cross validation data.

regressors = {
    "XGBRegressor": XGBRegressor(random_state=1)
}

df_models = pd.DataFrame(columns=['model', 'run_time', 'rmse', 'rmse_cv'])

for key in regressors:

    start_time = time.time()

    regressor = regressors[key]
    model = regressor.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    scores = cross_val_score(model, 
                             X_train, 
                             y_train,
                             scoring="neg_mean_squared_error", 
                             cv=10)

    row = {'model': key,
           'run_time': format(round((time.time() - start_time)/60,2)),
           'rmse': round(np.sqrt(mean_squared_error(y_test, y_pred))),
           'rmse_cv': round(np.mean(np.sqrt(-scores)))
    }

    df_models = df_models.append(row, ignore_index=True)

df_models.head()
model run_time rmse rmse_cv
0 XGBRegressor 0.05 49495 48810

Tune the Isolation Forest model

We’ll now use GridSearchCV to test a range of different hyperparameters to find the optimum settings for the IsolationForest model. On each iteration of the grid search, the model will be refitted to the training data with a new set of parameters, and the mean squared error will be recorded. Once all of the permutations have been tested, the optimum set of model parameters will be returned.

model = IsolationForest(random_state=47)

param_grid = {'n_estimators': [1000, 1500], 
              'max_samples': [10], 
              'contamination': ['auto', 0.0001, 0.0002], 
              'max_features': [10, 15], 
              'bootstrap': [True], 
              'n_jobs': [-1]}

grid_search = model_selection.GridSearchCV(model, 
                                           param_grid,
                                           scoring="neg_mean_squared_error", 
                                           refit=True,
                                           cv=10, 
                                           return_train_score=True)
grid_search.fit(X_train, y_train)

best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)
Optimum parameters {'bootstrap': True, 'contamination': 0.0001, 'max_features': 10, 'max_samples': 10, 'n_estimators': 1000, 'n_jobs': -1}

Run the tuned Isolation Forest model

Now we will fit an IsolationForest model to the training data (not the test data) using the optimum settings we identified using the grid search above. You may need to try a range of settings in the step above to find what works best, or you can just enter a load and leave your grid search to run overnight. Running the Isolation Forest model will return a Numpy array of predictions containing the outliers we need to remove.

iforest = IsolationForest(bootstrap=True,
                          contamination=0.0001, 
                          max_features=10, 
                          max_samples=10, 
                          n_estimators=1000, 
                          n_jobs=-1,
                         random_state=1)
y_pred = iforest.fit_predict(X_train)

Remove the outliers

We can now use the y_pred array to remove the offending values from the X_train and y_train data and return the new X_train_iforest and y_train_iforest. If you print the shape of the new X_train_iforest you’ll see that it now contains 14,446 values, compared to the 14,448 in the original dataset. The optimum Isolation Forest settings therefore removed just two of the outliers. Removing more caused the cross fold validation score to drop.

X_train_iforest, y_train_iforest = X_train[(y_pred != -1), :], y_train[(y_pred != -1)]
print(X_train_iforest.shape, y_train_iforest.shape)
(14446, 10) (14446,)

Refit the regression model

Finally, we can use the new inlier training data, with outliers removed, to re-fit the original XGBRegressor model on the new data and then compare the score with the one we obtained in the test fit earlier.

regressor = XGBRegressor(random_state=1)
model = regressor.fit(X_train_iforest, y_train_iforest)
y_pred = model.predict(X_test)

scores = cross_val_score(model, 
                         X_train_iforest, 
                         y_train_iforest,
                         scoring="neg_mean_squared_error", 
                         cv=10)

row = {'model': 'XGBRegressor Isolation Forest',
       'run_time': format(round((time.time() - start_time)/60,2)),
       'rmse': round(np.sqrt(mean_squared_error(y_test, y_pred))),
       'rmse_cv': round(np.mean(np.sqrt(-scores)))
}

df_models = df_models.append(row, ignore_index=True)

Compare the model’s performance

Comparing the performance of the base XGBRegressor on the full data set shows that we improved the RMSE from the original score of 49,495 on the test data, down to 48,677 on the test data after the two outliers were removed. Although this is only a modest improvement, every little helps and when combined with other methods, such as the tuning of the XGBoost model, this should add up to a nice performance increase.

df_models.head()
model run_time rmse rmse_cv
0 XGBRegressor 0.05 49495 48810
1 XGBRegressor Isolation Forest 7.85 48677 48833

Matt Clarke, Sunday, March 14, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Unsupervised Learning in Python

Learn how to cluster, transform, visualize, and extract insights from unlabeled datasets using scikit-learn and scipy.

Start course for FREE

Comments