Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets, detecting and removing outliers is much harder, so data scientists often apply automated anomaly detection algorithms, such as the Isolation Forest, to help identify and remove outliers.

As the name suggests, the Isolation Forest is a tree-based anomaly detection algorithm. It uses an unsupervised learning approach to detect unusual data points which can then be removed from the training data. The re-training of the model on a data set with the outliers removed generally sees performance increase.

The basic idea is that you fit a base classification or regression model to your data to use as a benchmark, and then fit an outlier detection algorithm model such as an Isolation Forest to detect outliers in the training data set. The detected outliers are then removed from the training data and you re-fit the model to the new data to see if the performance improves.

Like other models, Isolation Forest models do require hyperparameter tuning to generate their best results, particularly the important “contamination” value. While you can try random settings until you find a selection that gives good results, you’ll generate the biggest performance boost by using a grid search technique with cross validation. Here’s how it’s done.

For this simplified example we’re going to fit an XGBRegressor regression model, train an Isolation Forest model to remove the outliers, and then re-fit the XGBRegressor with the new training data set. Load the packages into a Jupyter notebook and install anything you don’t have by entering `pip3 install package-name`

.

```
import time
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import IsolationForest
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")
```

You can use any data set, but I’ve used the California housing data set, because I know it includes some outliers that impact the performance of regression models.

You can load the data set into Pandas via my GitHub repository to save downloading it. The aim of the model will be to predict the `median_house_value`

from a range of other features.

```
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv')
df.head().T
```

0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|

longitude | -122.23 | -122.22 | -122.24 | -122.25 | -122.25 |

latitude | 37.88 | 37.86 | 37.85 | 37.85 | 37.85 |

housing_median_age | 41 | 21 | 52 | 52 | 52 |

total_rooms | 880 | 7099 | 1467 | 1274 | 1627 |

total_bedrooms | 129 | 1106 | 190 | 235 | 280 |

population | 322 | 2401 | 496 | 558 | 565 |

households | 126 | 1138 | 177 | 219 | 259 |

median_income | 8.3252 | 8.3014 | 7.2574 | 5.6431 | 3.8462 |

median_house_value | 452600 | 358500 | 352100 | 341300 | 342200 |

ocean_proximity | NEAR BAY | NEAR BAY | NEAR BAY | NEAR BAY | NEAR BAY |

Next, I’ve done some data prep work. The `ocean_proximity`

column is a categorical variable, so I’ve lowercased the column values and used `get_dummies()`

to one-hot encoded the data. Then I’ve dropped the collinear columns `households`

, `bedrooms`

, and `population`

and used zero-imputation to fill in any missing values.

```
df['ocean_proximity'] = df['ocean_proximity'].str.lower().replace('[^0-9a-zA-Z]+','_',regex=True)
encodings = pd.get_dummies(df['ocean_proximity'], prefix='proximity')
df = pd.concat([df, encodings], axis=1)
df = df.drop(["households","total_bedrooms","population"],axis=1)
df = df.fillna(0)
```

Now the data are sorted, we’ll drop the `ocean_proximity`

column, split the data into the train and test datasets, and scale the data using `StandardScaler()`

so the various column values are on an even scale.

```
X = df.drop(['median_house_value','ocean_proximity'], axis=1)
y = df['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

```
X_train.shape
```

```
(14448, 10)
```

Once the data are split and scaled, we’ll fit a default and un-tuned `XGBRegressor()`

model to the training data and
use cross validation to determine the mean squared error for the 10 folds and the Root Mean Squared error from the test data set. We’ll use this as our baseline result to which we can compare the tuned results. This gives us an RMSE of 49,495 on the test data and a score of 48,810 on the cross validation data.

```
regressors = {
"XGBRegressor": XGBRegressor(random_state=1)
}
df_models = pd.DataFrame(columns=['model', 'run_time', 'rmse', 'rmse_cv'])
for key in regressors:
start_time = time.time()
regressor = regressors[key]
model = regressor.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores = cross_val_score(model,
X_train,
y_train,
scoring="neg_mean_squared_error",
cv=10)
row = {'model': key,
'run_time': format(round((time.time() - start_time)/60,2)),
'rmse': round(np.sqrt(mean_squared_error(y_test, y_pred))),
'rmse_cv': round(np.mean(np.sqrt(-scores)))
}
df_models = df_models.append(row, ignore_index=True)
df_models.head()
```

model | run_time | rmse | rmse_cv | |
---|---|---|---|---|

0 | XGBRegressor | 0.05 | 49495 | 48810 |

We’ll now use `GridSearchCV`

to test a range of different hyperparameters to find the optimum settings for the IsolationForest model. On each iteration of the grid search, the model will be refitted to the training data with a new set of parameters, and the mean squared error will be recorded. Once all of the permutations have been tested, the optimum set of model parameters will be returned.

```
model = IsolationForest(random_state=47)
param_grid = {'n_estimators': [1000, 1500],
'max_samples': [10],
'contamination': ['auto', 0.0001, 0.0002],
'max_features': [10, 15],
'bootstrap': [True],
'n_jobs': [-1]}
grid_search = model_selection.GridSearchCV(model,
param_grid,
scoring="neg_mean_squared_error",
refit=True,
cv=10,
return_train_score=True)
grid_search.fit(X_train, y_train)
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)
```

```
Optimum parameters {'bootstrap': True, 'contamination': 0.0001, 'max_features': 10, 'max_samples': 10, 'n_estimators': 1000, 'n_jobs': -1}
```

Now we will fit an IsolationForest model to the training data (not the test data) using the optimum settings we identified using the grid search above. You may need to try a range of settings in the step above to find what works best, or you can just enter a load and leave your grid search to run overnight. Running the Isolation Forest model will return a Numpy array of predictions containing the outliers we need to remove.

```
iforest = IsolationForest(bootstrap=True,
contamination=0.0001,
max_features=10,
max_samples=10,
n_estimators=1000,
n_jobs=-1,
random_state=1)
y_pred = iforest.fit_predict(X_train)
```

We can now use the `y_pred`

array to remove the offending values from the `X_train`

and `y_train`

data and return the new `X_train_iforest`

and `y_train_iforest`

. If you print the `shape`

of the new `X_train_iforest`

you’ll see that it now contains 14,446 values, compared to the 14,448 in the original dataset. The optimum Isolation Forest settings therefore removed just two of the outliers. Removing more caused the cross fold validation score to drop.

```
X_train_iforest, y_train_iforest = X_train[(y_pred != -1), :], y_train[(y_pred != -1)]
```

```
print(X_train_iforest.shape, y_train_iforest.shape)
```

```
(14446, 10) (14446,)
```

Finally, we can use the new inlier training data, with outliers removed, to re-fit the original `XGBRegressor`

model on the new data and then compare the score with the one we obtained in the test fit earlier.

```
regressor = XGBRegressor(random_state=1)
model = regressor.fit(X_train_iforest, y_train_iforest)
y_pred = model.predict(X_test)
scores = cross_val_score(model,
X_train_iforest,
y_train_iforest,
scoring="neg_mean_squared_error",
cv=10)
row = {'model': 'XGBRegressor Isolation Forest',
'run_time': format(round((time.time() - start_time)/60,2)),
'rmse': round(np.sqrt(mean_squared_error(y_test, y_pred))),
'rmse_cv': round(np.mean(np.sqrt(-scores)))
}
df_models = df_models.append(row, ignore_index=True)
```

Comparing the performance of the base `XGBRegressor`

on the full data set shows that we improved the RMSE from the original score of 49,495 on the test data, down to 48,677 on the test data after the two outliers were removed. Although this is only a modest improvement, every little helps and when combined with other methods, such as the tuning of the XGBoost model, this should add up to a nice performance increase.

```
df_models.head()
```

model | run_time | rmse | rmse_cv | |
---|---|---|---|---|

0 | XGBRegressor | 0.05 | 49495 | 48810 |

1 | XGBRegressor Isolation Forest | 7.85 | 48677 | 48833 |

Matt Clarke, Sunday, March 14, 2021