The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for...
Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...
When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often...
The CatBoost model is a gradient boosting model that is based on decision trees, much like XGBoost, LightGBM, and other tree-based models. It is a very popular model for tabular...
The XGBRegressor regression model in XGBoost is one of the most effective regression models used in machine learning. As with the other XGBoost models, XGBRegressor is a gradient boosting model...
AdaBoost is a boosting algorithm that combines multiple weak learners into a strong learner. It is a sequential technique that works by fitting a classifier on the original dataset and...
OpenAI Whisper is a new open source automatic speech recognition (ASR) model from Elon Musk’s OpenAI project that has also brought us the incredible GPT-3 language models. Like GPT-3, it’s...
Over the past year or so, the Optuna package has quickly become a favourite among data scientists for hyperparameter tuning on machine learning models, and for good reason. It’s lightweight,...
Fake reviews seem to be everywhere these days, leaving customers unsure over which products or businesses are actually any good. Whether you’re shopping on Amazon, checking out a restaurant on...
Tokenization is a data science technique that breaks up the words in a sentence into a comma separated list of distinct words or values. It’s a crucial first step in...
Naive Bayes classifiers are commonly used for machine learning text classification problems, such as predicting the sentiment of a tweet, identifying the language of a piece of text, or categorising...
When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%)...
The random forest model or random decision forest model is a supervised machine learning algorithm that can be used for classification or regression problems. It’s what’s known as an ensemble...
The Decision Tree or DT is one of the most well known and most widely used supervised machine learning algorithms and can be applied to both regression and classification. As...
In ecommerce, it is often difficult to tell whether your search traffic is performing to expectations. What your boss perceives to be caused by an on-site or marketing-related issue may...
CountVectorizer is a scikit-learn package that uses count vectorization to convert a collection of text documents to a matrix of token counts. Given a corpus of text documents, such as...
Time series forecasting uses machine learning to predict future values of time series data. In this project we’ll be using the Neural Prophet model to predict future values of Google...
A growing proportion of what we buy regularly is purchased via a subscription, or some other kind of contract. Most people have contracts for their internet, mobile phone, car insurance,...
One issue with the more sophisticated algorithms, such as Extreme Gradient Boosting, is that they can overfit to the data. This basically means that the model picks up the idiosyncrasies...
In ecommerce, customer service staff are often among the busiest people in the organisation, handling hundreds of tasks every day, often simultaneously. However, CS managers often get so bogged down...
There’s often a great deal of repetition in machine learning projects. A typical machine learning workflow involves a number of common processes designed to clean, prepare, and transform data, so...
One common conundrum in e-commerce and marketing involves trying to ascertain whether a given change in marketing activity, product price, or site design or content, has had a statistically significant...
One interesting technique in feature engineering is the use of Decision Trees (and other models) to create or derive new features using combinations of features from the original dataset. Here,...
Ecommerce purchase intention models analyse click-stream consumer behaviour data from web analytics platforms to predict whether a customer will make a purchase during their visit. These online shopping models are...
The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become...
In the field of HR analytics, data scientists are now using employee data from their human resources department to predict employee churn. The techniques for predicting employee churn are fairly...
Meta descriptions are strings of text added to the head of an HTML document to describe its content to search engines and search engine users and are of critical importance...
Marketing Mix Models (MMMs) utilise multivariate linear regression to predict sales from marketing costs, and various other parameters. A Marketing Mix Model (also called a Media Mix Model), even at...
The Neural Prophet model is relatively new and was heavily inspired by Facebook’s earlier Prophet time series forecasting model. NeuralProphet is a neural network based model that uses a PyTorch...
Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets,...
K means is one of the most widely used algorithms for clustering data and falls into the unsupervised learning group of machine learning models. It’s ideal for many forms of...
Understanding what drives customer churn is critical to business success. While there is always going to be natural churn that you can’t prevent, the most common reasons for churn are...
In the ecommerce sector, one of the most common tasks you’ll undertake after arriving at work each morning is to check over the recent analytics data for your site and...
Knowing which of your customers are going to churn before it happens is a powerful tool in the battle against attrition, since you can take action and try to prevent...
Zero-shot learning, or ZSL, is a machine learning process commonly used for Natural Language Processing that allows you to generate predictions on unseen data without the need to train a...
Several years ago, in one of my first Ecommerce Director roles, I worked with the ex-Myprotein founder to launch sports nutrition brand GoNutrition. As a “bootstrapped” startup, we were low...
In ecommerce, writing good product copy is both an art and a science. Not only does product copy need to be written in the correct tone and style for your...
Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several...
Time series data have a reputation for being somewhat complicated, partly because they’re made up of a number of different components that work together. At the most basic level these...
Ecommerce copywriters are busy people and don’t have the privilege of having eagle-eyed sub editors to sub-edit their copy and check it for spelling mistakes or grammatical issues, as magazine...
Product matching or data matching is a computational technique employing Natural Language Processing and machine learning which aims to identify identical products being sold on different websites, where product names...
Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that...
When using the k means clustering algorithm, you need to specifically define k, or the number of clusters you want the algorithm to create. Rather than selecting an arbitrary value,...
There’s often a lot of repetition in many data science projects. In tasks that utilise Natural Language Processing (or NLP), for example, you’ll always need to preprocess your text to...
I love sarcasm, but unfortunately I have a shaky ability to easily detect it in the voices of others, an aptitude for misinterpreting serious comments for sarcasm and then inappropriately...
Long before Donald Trump erroneously applied it to mean “news that he didn’t agree with”, the term “fake news” referred to disinformation and misleading editorial content. In recent years, it’s...
Search intent classification has been around for almost 20 years, but has only recently started to move into the mainstream in ecommerce and technical SEO. Here’s a quick guide to...
Product matching (or data matching) is a computational technique employing Natural Language Processing, machine learning, or deep learning, which aims to identify identical products being sold on different websites, where...
Imbalanced classification problems, such as the detection of fraudulent card payments, represent a significant challenge for machine learning models. When the target class, such as fraudulent transactions, makes up such...
Something which often confuses non data scientists is that too many features can be a bad thing for a model. It does sound logical that including more features and data...
There are many techniques you can apply to improve the performance of your machine learning models, but two of the most powerful are model selection and hyperparameter tuning. As models...
There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on...
Machine learning models often take hours or days to run, especially on large datasets with many features. If your machine goes off, you’ll lose your model and you’ll need to...
Time series forecasting models are notoriously tricky to master, especially in ecommerce, where you have seasonality, the weather, marketing promotions, and holidays to consider. Not to mention pandemics.
The predictive response models used to help identify customers in marketing can also be used to help outbound sales teams improve their call conversion rate by targeting the best people...
Linear regression models are widely used in every industry. They predict a number from a range of other features based on a linear relationship between the input variables (X) and...
When you’re building a Natural Language Processing model, it’s the text annotation process which is the most laborious and the most expensive for your business. While you can use tools...
While there are many open source datasets available for you to use when learning new data science techniques, sometimes you may struggle to find a data set to use to...
While many models are now pre-trained to identify certain objects, in most cases you will need to undertake further training. This requires the construction of image classification datasets containing a...
Data binning, bucketing, or discrete binning, is a very useful technique for both preprocessing and understanding or visualising complex data, especially during the customer segmentation process. It’s applied to continuous...
Whether you’re performing product attribute extraction, named entity recognition, product matching, product categorisation, review sentiment analysis, or you are sorting and prioritising customer support tickets, NLP models can be extremely...
If you read research papers on machine learning, you’ll notice that many researchers use the same standard datasets so other data scientists can reproduce their work or try and improve...
You might think human behaviour would be hard to predict but, in ecommerce data science, it’s not actually as difficult as you may think to predict whether a customer will...
While some people might naively interpret it as negativity, I think one of the best ways you can improve an ecommerce business is to focus on the stuff you’re not...
Most datasets you’ll encounter will probably contain categorical variables. They are often highly informative, but the downside is that they’re based on object or datetime data types such as text...
Product attributes, such as size, weight, wattage, or colour, are critical in ecommerce as they help customers find and select the right product for their needs. However, obtaining, adding, and...
A Next-Product-To-Buy (or NPTB) model is designed to help retailers and marketers improve the effectiveness of cross-selling product recommendations by predicting the product each customer would be most likely to...
Machine learning (ML) is a branch of artificial intelligence (AI) in which models are created to predict an outcome by learning from patterns present in data. They can automatically improve...
Uplift modeling is a machine learning technique used in marketing and ecommerce to predict which customers are likely to respond to a particular marketing campaign. However, rather than simply predicting...
On-site search in ecommerce has improved massively in recent years, thanks to search systems such as Lucene, Solr, Algolia, and Elastic. Despite on-site search generating massive amounts of revenue for...
Hugging Face Transformers are a collection of State-of-the-Art (SOTA) natural language processing models produced by the Hugging Face group. Basically, Hugging Face take the latest models covered in current natural...
Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called...
Despite having been a Linux user for about 20 years, there are times when I find I have wasted days trying to solve a seemingly simple problem. One such issue...
Facial recognition algorithms have made giant steps in the past decade and have become commonplace in everything from social networks and mobile phone camera software, to surveillance systems. They make...
Have you ever wanted to remove the singing from a track, so you can create an instrumental version to sing Karaoke to? Or do you want to remix a track...
Convolutional Neural Networks or CNNs are one of the most widely used AI techniques for detecting complex features in data. They’re particularly good for image recognition, and are used in...
Sentiment analysis, or opinion mining, is a form of emotion AI and uses natural language processing and computational linguistics to analyse text and infer the sentiment. Sentiment analysis has loads...
If you’re not fortunate enough to have a really powerful data science workstation for your work, one of the problems you’ll likely face is that your models can take quite...
When you’re building a machine learning model, the feature engineering step is often the most important. From your initial small batch of features, the clever use of maths and stats...
As a practical demonstration of how the confusion matrix works, lets load up the Wisconsin Breast Cancer dataset, create a classification model and examine the confusion matrix to see how...
As models require numeric data and don’t like NaN, null, or inf values, if you find these within your dataset you’ll need to deal with them before passing the data...