The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for...
Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...
When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often...
The CatBoost model is a gradient boosting model that is based on decision trees, much like XGBoost, LightGBM, and other tree-based models. It is a very popular model for tabular...
The XGBRegressor regression model in XGBoost is one of the most effective regression models used in machine learning. As with the other XGBoost models, XGBRegressor is a gradient boosting model...
AdaBoost is a boosting algorithm that combines multiple weak learners into a strong learner. It is a sequential technique that works by fitting a classifier on the original dataset and...
Over the past year or so, the Optuna package has quickly become a favourite among data scientists for hyperparameter tuning on machine learning models, and for good reason. It’s lightweight,...
Naive Bayes classifiers are commonly used for machine learning text classification problems, such as predicting the sentiment of a tweet, identifying the language of a piece of text, or categorising...
When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%)...
The random forest model or random decision forest model is a supervised machine learning algorithm that can be used for classification or regression problems. It’s what’s known as an ensemble...
The Decision Tree or DT is one of the most well known and most widely used supervised machine learning algorithms and can be applied to both regression and classification. As...
CountVectorizer is a scikit-learn package that uses count vectorization to convert a collection of text documents to a matrix of token counts. Given a corpus of text documents, such as...
After work, when I’m not learning about data science, practising data science, or writing about data science, I like to browse classic car auction sites looking for cars I can’t...
A growing proportion of what we buy regularly is purchased via a subscription, or some other kind of contract. Most people have contracts for their internet, mobile phone, car insurance,...
In ecommerce, customer service staff are often among the busiest people in the organisation, handling hundreds of tasks every day, often simultaneously. However, CS managers often get so bogged down...
There’s often a great deal of repetition in machine learning projects. A typical machine learning workflow involves a number of common processes designed to clean, prepare, and transform data, so...
One interesting technique in feature engineering is the use of Decision Trees (and other models) to create or derive new features using combinations of features from the original dataset. Here,...
If you want to change careers and move into the data science or data mining field, as either a data scientist or a data engineer, or simply improve your skills,...
Ecommerce purchase intention models analyse click-stream consumer behaviour data from web analytics platforms to predict whether a customer will make a purchase during their visit. These online shopping models are...
The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become...
In the field of HR analytics, data scientists are now using employee data from their human resources department to predict employee churn. The techniques for predicting employee churn are fairly...
Marketing Mix Models (MMMs) utilise multivariate linear regression to predict sales from marketing costs, and various other parameters. A Marketing Mix Model (also called a Media Mix Model), even at...
Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets,...
K means is one of the most widely used algorithms for clustering data and falls into the unsupervised learning group of machine learning models. It’s ideal for many forms of...
RFM segmentation is one of the oldest and most effective ways to segment customers. RFM models are based on three simple values - recency, frequency, and monetary value - which...
Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several...
Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that...
When using the k means clustering algorithm, you need to specifically define k, or the number of clusters you want the algorithm to create. Rather than selecting an arbitrary value,...
I love sarcasm, but unfortunately I have a shaky ability to easily detect it in the voices of others, an aptitude for misinterpreting serious comments for sarcasm and then inappropriately...
Long before Donald Trump erroneously applied it to mean “news that he didn’t agree with”, the term “fake news” referred to disinformation and misleading editorial content. In recent years, it’s...
There are hundreds of excellent Python data science libraries and packages that you’ll encounter when working on data science projects. However, there are four of them that you’ll probably use...
Imbalanced classification problems, such as the detection of fraudulent card payments, represent a significant challenge for machine learning models. When the target class, such as fraudulent transactions, makes up such...
Something which often confuses non data scientists is that too many features can be a bad thing for a model. It does sound logical that including more features and data...
There are many techniques you can apply to improve the performance of your machine learning models, but two of the most powerful are model selection and hyperparameter tuning. As models...
There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on...
Machine learning models often take hours or days to run, especially on large datasets with many features. If your machine goes off, you’ll lose your model and you’ll need to...
The predictive response models used to help identify customers in marketing can also be used to help outbound sales teams improve their call conversion rate by targeting the best people...
Linear regression models are widely used in every industry. They predict a number from a range of other features based on a linear relationship between the input variables (X) and...
While there are many open source datasets available for you to use when learning new data science techniques, sometimes you may struggle to find a data set to use to...
ABC inventory classification has been one of the most widely used methods of stock control in operations management for decades. It’s an intentionally simple system in which products are assigned...
Data binning, bucketing, or discrete binning, is a very useful technique for both preprocessing and understanding or visualising complex data, especially during the customer segmentation process. It’s applied to continuous...
Most datasets you’ll encounter will probably contain categorical variables. They are often highly informative, but the downside is that they’re based on object or datetime data types such as text...
Machine learning (ML) is a branch of artificial intelligence (AI) in which models are created to predict an outcome by learning from patterns present in data. They can automatically improve...
Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called...
If you’re not fortunate enough to have a really powerful data science workstation for your work, one of the problems you’ll likely face is that your models can take quite...
The scikit-learn package comes with a range of small built-in toy datasets that are ideal for using in test projects and applications. As they’re part of the scikit-learn package, you...
As a practical demonstration of how the confusion matrix works, lets load up the Wisconsin Breast Cancer dataset, create a classification model and examine the confusion matrix to see how...
As models require numeric data and don’t like NaN, null, or inf values, if you find these within your dataset you’ll need to deal with them before passing the data...