Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...
Over the past year or so, the Optuna package has quickly become a favourite among data scientists for hyperparameter tuning on machine learning models, and for good reason. It’s lightweight,...
One issue with the more sophisticated algorithms, such as Extreme Gradient Boosting, is that they can overfit to the data. This basically means that the model picks up the idiosyncrasies...
One interesting technique in feature engineering is the use of Decision Trees (and other models) to create or derive new features using combinations of features from the original dataset. Here,...
The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become...
Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets,...
Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several...
Product matching or data matching is a computational technique employing Natural Language Processing and machine learning which aims to identify identical products being sold on different websites, where product names...
Long before Donald Trump erroneously applied it to mean “news that he didn’t agree with”, the term “fake news” referred to disinformation and misleading editorial content. In recent years, it’s...
Imbalanced classification problems, such as the detection of fraudulent card payments, represent a significant challenge for machine learning models. When the target class, such as fraudulent transactions, makes up such...
There are many techniques you can apply to improve the performance of your machine learning models, but two of the most powerful are model selection and hyperparameter tuning. As models...
There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on...
Machine learning models often take hours or days to run, especially on large datasets with many features. If your machine goes off, you’ll lose your model and you’ll need to...
Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called...
If you’re not fortunate enough to have a really powerful data science workstation for your work, one of the problems you’ll likely face is that your models can take quite...
When you’re building a machine learning model, the feature engineering step is often the most important. From your initial small batch of features, the clever use of maths and stats...
As a practical demonstration of how the confusion matrix works, lets load up the Wisconsin Breast Cancer dataset, create a classification model and examine the confusion matrix to see how...