Machine Learning

81 articles and tutorials on machine learning using Python

How to tune a LightGBMClassifier model with Optuna

The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for...

How to create a customer retention model with XGBoost

Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...

How to add feature engineering to a scikit-learn pipeline

When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often...

How to tune a CatBoostClassifier model with Optuna

The CatBoost model is a gradient boosting model that is based on decision trees, much like XGBoost, LightGBM, and other tree-based models. It is a very popular model for tabular...

How to tune an XGBRegressor model with Optuna

The XGBRegressor regression model in XGBoost is one of the most effective regression models used in machine learning. As with the other XGBoost models, XGBRegressor is a gradient boosting model...

How to create and tune an AdaBoost classification model

AdaBoost is a boosting algorithm that combines multiple weak learners into a strong learner. It is a sequential technique that works by fitting a classifier on the original dataset and...

How to transcribe YouTube videos with OpenAI Whisper

OpenAI Whisper is a new open source automatic speech recognition (ASR) model from Elon Musk’s OpenAI project that has also brought us the incredible GPT-3 language models. Like GPT-3, it’s...

How to use Optuna for XGBoost hyperparameter tuning

Over the past year or so, the Optuna package has quickly become a favourite among data scientists for hyperparameter tuning on machine learning models, and for good reason. It’s lightweight,...

How to create a fake review detection model

Fake reviews seem to be everywhere these days, leaving customers unsure over which products or businesses are actually any good. Whether you’re shopping on Amazon, checking out a restaurant on...

How to perform tokenization in NLP with NLTK and Python

Tokenization is a data science technique that breaks up the words in a sentence into a comma separated list of distinct words or values. It’s a crucial first step in...

How to create a Naive Bayes text classification model using scikit-learn

Naive Bayes classifiers are commonly used for machine learning text classification problems, such as predicting the sentiment of a tweet, identifying the language of a piece of text, or categorising...

How to use cross validation in scikit-learn machine learning models

When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%)...

How to create a random forest classification model using scikit-learn

The random forest model or random decision forest model is a supervised machine learning algorithm that can be used for classification or regression problems. It’s what’s known as an ensemble...

How to create a decision tree classification model using scikit-learn

The Decision Tree or DT is one of the most well known and most widely used supervised machine learning algorithms and can be applied to both regression and classification. As...

How to forecast Google Trends search data with NeuralProphet

In ecommerce, it is often difficult to tell whether your search traffic is performing to expectations. What your boss perceives to be caused by an on-site or marketing-related issue may...

How to use CountVectorizer for n-gram analysis

CountVectorizer is a scikit-learn package that uses count vectorization to convert a collection of text documents to a matrix of token counts. Given a corpus of text documents, such as...

How to create Google Search Console time series forecasts using Neural Prophet

Time series forecasting uses machine learning to predict future values of time series data. In this project we’ll be using the Neural Prophet model to predict future values of Google...

How to create a contractual churn model in scikit-learn

A growing proportion of what we buy regularly is purchased via a subscription, or some other kind of contract. Most people have contracts for their internet, mobile phone, car insurance,...

How to avoid model overfitting with early stopping rounds

One issue with the more sophisticated algorithms, such as Extreme Gradient Boosting, is that they can overfit to the data. This basically means that the model picks up the idiosyncrasies...

How to classify customer support tickets using Naive Bayes

In ecommerce, customer service staff are often among the busiest people in the organisation, handling hundreds of tasks every day, often simultaneously. However, CS managers often get so bogged down...

How to use pipelines in your machine learning models

There’s often a great deal of repetition in machine learning projects. A typical machine learning workflow involves a number of common processes designed to clean, prepare, and transform data, so...

How to infer the effects of marketing using the Causal Impact model

One common conundrum in e-commerce and marketing involves trying to ascertain whether a given change in marketing activity, product price, or site design or content, has had a statistically significant...

How to engineer new features using Decision Tree models

One interesting technique in feature engineering is the use of Decision Trees (and other models) to create or derive new features using combinations of features from the original dataset. Here,...

How to create an ecommerce purchase intention model in Python

Ecommerce purchase intention models analyse click-stream consumer behaviour data from web analytics platforms to predict whether a customer will make a purchase during their visit. These online shopping models are...

How to create a classification model using XGBoost in Python

The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become...

How to predict employee churn using CatBoost

In the field of HR analytics, data scientists are now using employee data from their human resources department to predict employee churn. The techniques for predicting employee churn are fairly...

How to auto-generate meta descriptions with EcommerceTools

Meta descriptions are strings of text added to the head of an HTML document to describe its content to search engines and search engine users and are of critical importance...

How to create a basic Marketing Mix Model in scikit-learn

Marketing Mix Models (MMMs) utilise multivariate linear regression to predict sales from marketing costs, and various other parameters. A Marketing Mix Model (also called a Media Mix Model), even at...

How to make time series forecasts with Neural Prophet

The Neural Prophet model is relatively new and was heavily inspired by Facebook’s earlier Prophet time series forecasting model. NeuralProphet is a neural network based model that uses a PyTorch...

How to use the Isolation Forest model for outlier detection

Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets,...

How to use k means clustering for customer segmentation

K means is one of the most widely used algorithms for clustering data and falls into the unsupervised learning group of machine learning models. It’s ideal for many forms of...

How to identify the causes of customer churn

Understanding what drives customer churn is critical to business success. While there is always going to be natural churn that you can’t prevent, the most common reasons for churn are...

How to create ecommerce anomaly detection models

In the ecommerce sector, one of the most common tasks you’ll undertake after arriving at work each morning is to check over the recent analytics data for your site and...

How to create a non-contractual churn model for ecommerce

Knowing which of your customers are going to churn before it happens is a powerful tool in the battle against attrition, since you can take action and try to prevent...

How to classify customer service emails with Bart MNLI

Zero-shot learning, or ZSL, is a machine learning process commonly used for Natural Language Processing that allows you to generate predictions on unseen data without the need to train a...

How to auto-generate product summaries using deep learning

Several years ago, in one of my first Ecommerce Director roles, I worked with the ex-Myprotein founder to launch sports nutrition brand GoNutrition. As a “bootstrapped” startup, we were low...

How to assess product copy using EQA models

In ecommerce, writing good product copy is both an art and a science. Not only does product copy need to be written in the correct tone and style for your...

How to use bagging, boosting, and stacking in ensembles

Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several...

How to perform time series decomposition

Time series data have a reputation for being somewhat complicated, partly because they’re made up of a number of different components that work together. At the most basic level these...

How to find spelling and grammar issues on product pages

Ecommerce copywriters are busy people and don’t have the privilege of having eagle-eyed sub editors to sub-edit their copy and check it for spelling mistakes or grammatical issues, as magazine...

How to create a product matching model using XGBoost

Product matching or data matching is a computational technique employing Natural Language Processing and machine learning which aims to identify identical products being sold on different websites, where product names...

How to create a Naive Bayes product classification model

Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that...

How to use knee point detection in k means clustering

When using the k means clustering algorithm, you need to specifically define k, or the number of clusters you want the algorithm to create. Rather than selecting an arbitrary value,...

How to preprocess text for NLP in four easy steps

There’s often a lot of repetition in many data science projects. In tasks that utilise Natural Language Processing (or NLP), for example, you’ll always need to preprocess your text to...

How to detect sarcasm using machine learning

I love sarcasm, but unfortunately I have a shaky ability to easily detect it in the voices of others, an aptitude for misinterpreting serious comments for sarcasm and then inappropriately...

How to detect fake news with machine learning

Long before Donald Trump erroneously applied it to mean “news that he didn’t agree with”, the term “fake news” referred to disinformation and misleading editorial content. In recent years, it’s...

A quick guide to search intent classification for SEO

Search intent classification has been around for almost 20 years, but has only recently started to move into the mainstream in ecommerce and technical SEO. Here’s a quick guide to...

How to create a dataset for product matching models

Product matching (or data matching) is a computational technique employing Natural Language Processing, machine learning, or deep learning, which aims to identify identical products being sold on different websites, where...

How to use SMOTE for imbalanced classification

Imbalanced classification problems, such as the detection of fraudulent card payments, represent a significant challenge for machine learning models. When the target class, such as fraudulent transactions, makes up such...

How to use Recursive Feature Elimination in your models using RFECV

Something which often confuses non data scientists is that too many features can be a bad thing for a model. It does sound logical that including more features and data...

How to use model selection and hyperparameter tuning

There are many techniques you can apply to improve the performance of your machine learning models, but two of the most powerful are model selection and hyperparameter tuning. As models...

How to use transform categorical variables using encoders

There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on...

How to save and load machine learning models using Pickle

Machine learning models often take hours or days to run, especially on large datasets with many features. If your machine goes off, you’ll lose your model and you’ll need to...

How to create ecommerce sales forecasts using Prophet

Time series forecasting models are notoriously tricky to master, especially in ecommerce, where you have seasonality, the weather, marketing promotions, and holidays to consider. Not to mention pandemics.

How to create a response model to improve outbound sales

The predictive response models used to help identify customers in marketing can also be used to help outbound sales teams improve their call conversion rate by targeting the best people...

How to create a linear regression model using Scikit-Learn

Linear regression models are widely used in every industry. They predict a number from a range of other features based on a linear relationship between the input variables (X) and...

How to speed up the NLP text annotation process

When you’re building a Natural Language Processing model, it’s the text annotation process which is the most laborious and the most expensive for your business. While you can use tools...

How to create synthetic data sets for machine learning

While there are many open source datasets available for you to use when learning new data science techniques, sometimes you may struggle to find a data set to use to...

How to create image datasets for machine learning models

While many models are now pre-trained to identify certain objects, in most cases you will need to undertake further training. This requires the construction of image classification datasets containing a...

How to bin or bucket customer data using Pandas

Data binning, bucketing, or discrete binning, is a very useful technique for both preprocessing and understanding or visualising complex data, especially during the customer segmentation process. It’s applied to continuous...

How to annotate training data for NLP models using Doccano

Whether you’re performing product attribute extraction, named entity recognition, product matching, product categorisation, review sentiment analysis, or you are sorting and prioritising customer support tickets, NLP models can be extremely...

Ecommerce and marketing data sets for machine learning

If you read research papers on machine learning, you’ll notice that many researchers use the same standard datasets so other data scientists can reproduce their work or try and improve...

How to use the BG/NBD model to predict customer purchases

You might think human behaviour would be hard to predict but, in ecommerce data science, it’s not actually as difficult as you may think to predict whether a customer will...

How to use NLP to identify what drives customer satisfaction

While some people might naively interpret it as negativity, I think one of the best ways you can improve an ecommerce business is to focus on the stuff you’re not...

How to use Category Encoders to encode categorical variables

Most datasets you’ll encounter will probably contain categorical variables. They are often highly informative, but the downside is that they’re based on object or datetime data types such as text...

A quick guide to Product Attribute Extraction models

Product attributes, such as size, weight, wattage, or colour, are critical in ecommerce as they help customers find and select the right product for their needs. However, obtaining, adding, and...

A quick guide to Next-Product-To-Buy models

A Next-Product-To-Buy (or NPTB) model is designed to help retailers and marketers improve the effectiveness of cross-selling product recommendations by predicting the product each customer would be most likely to...

A quick guide to machine learning

Machine learning (ML) is a branch of artificial intelligence (AI) in which models are created to predict an outcome by learning from patterns present in data. They can automatically improve...

A quick guide to machine learning uplift models

Uplift modeling is a machine learning technique used in marketing and ecommerce to predict which customers are likely to respond to a particular marketing campaign. However, rather than simply predicting...

A quick guide to Learning to Rank models

On-site search in ecommerce has improved massively in recent years, thanks to search systems such as Lucene, Solr, Algolia, and Elastic. Despite on-site search generating massive amounts of revenue for...

How to use Natural Language Understanding models

Hugging Face Transformers are a collection of State-of-the-Art (SOTA) natural language processing models produced by the Hugging Face group. Basically, Hugging Face take the latest models covered in current natural...

How to tune model hyper-parameters with grid search

Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called...

How to test your Keras, CUDA, CuDNN, and TensorFlow install

Despite having been a Linux user for about 20 years, there are times when I find I have wasted days trying to solve a seemingly simple problem. One such issue...

How to perform facial recognition in Python

Facial recognition algorithms have made giant steps in the past decade and have become commonplace in everything from social networks and mobile phone camera software, to surveillance systems. They make...

How to separate audio source data using Spleeter

Have you ever wanted to remove the singing from a track, so you can create an instrumental version to sing Karaoke to? Or do you want to remix a track...

How to build the 'Hotdog , not Hotdog' image classifier

Convolutional Neural Networks or CNNs are one of the most widely used AI techniques for detecting complex features in data. They’re particularly good for image recognition, and are used in...

How to create a neural network for sentiment analysis

Sentiment analysis, or opinion mining, is a form of emotion AI and uses natural language processing and computational linguistics to analyse text and infer the sentiment. Sentiment analysis has loads...

How to use your GPU to accelerate XGBoost models

If you’re not fortunate enough to have a really powerful data science workstation for your work, one of the problems you’ll likely face is that your models can take quite...

How to use mean encoding in your machine learning models

When you’re building a machine learning model, the feature engineering step is often the most important. From your initial small batch of features, the clever use of maths and stats...

How to interpret the confusion matrix

As a practical demonstration of how the confusion matrix works, lets load up the Wisconsin Breast Cancer dataset, create a classification model and examine the confusion matrix to see how...

How to impute missing numeric values in your dataset

As models require numeric data and don’t like NaN, null, or inf values, if you find these within your dataset you’ll need to deal with them before passing the data...