scikit-learn

48 articles tagged scikit-learn

How to tune a LightGBMClassifier model with Optuna

The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for...

How to create a customer retention model with XGBoost

Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...

How to add feature engineering to a scikit-learn pipeline

When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often...

How to tune a CatBoostClassifier model with Optuna

The CatBoost model is a gradient boosting model that is based on decision trees, much like XGBoost, LightGBM, and other tree-based models. It is a very popular model for tabular...

How to tune an XGBRegressor model with Optuna

The XGBRegressor regression model in XGBoost is one of the most effective regression models used in machine learning. As with the other XGBoost models, XGBRegressor is a gradient boosting model...

How to create and tune an AdaBoost classification model

AdaBoost is a boosting algorithm that combines multiple weak learners into a strong learner. It is a sequential technique that works by fitting a classifier on the original dataset and...

How to use Optuna for XGBoost hyperparameter tuning

Over the past year or so, the Optuna package has quickly become a favourite among data scientists for hyperparameter tuning on machine learning models, and for good reason. It’s lightweight,...

How to create a Naive Bayes text classification model using scikit-learn

Naive Bayes classifiers are commonly used for machine learning text classification problems, such as predicting the sentiment of a tweet, identifying the language of a piece of text, or categorising...

How to use cross validation in scikit-learn machine learning models

When training a machine learning model you will split your dataset in two, with one portion of the data used to train the model, and the other portion (usually 20-30%)...

How to create a random forest classification model using scikit-learn

The random forest model or random decision forest model is a supervised machine learning algorithm that can be used for classification or regression problems. It’s what’s known as an ensemble...

How to create a decision tree classification model using scikit-learn

The Decision Tree or DT is one of the most well known and most widely used supervised machine learning algorithms and can be applied to both regression and classification. As...

How to use CountVectorizer for n-gram analysis

CountVectorizer is a scikit-learn package that uses count vectorization to convert a collection of text documents to a matrix of token counts. Given a corpus of text documents, such as...

How to create content recommendations using TF IDF

After work, when I’m not learning about data science, practising data science, or writing about data science, I like to browse classic car auction sites looking for cars I can’t...

How to create a contractual churn model in scikit-learn

A growing proportion of what we buy regularly is purchased via a subscription, or some other kind of contract. Most people have contracts for their internet, mobile phone, car insurance,...

How to classify customer support tickets using Naive Bayes

In ecommerce, customer service staff are often among the busiest people in the organisation, handling hundreds of tasks every day, often simultaneously. However, CS managers often get so bogged down...

How to use pipelines in your machine learning models

There’s often a great deal of repetition in machine learning projects. A typical machine learning workflow involves a number of common processes designed to clean, prepare, and transform data, so...

How to engineer new features using Decision Tree models

One interesting technique in feature engineering is the use of Decision Trees (and other models) to create or derive new features using combinations of features from the original dataset. Here,...

Data science courses for budding data scientists and data engineers

If you want to change careers and move into the data science or data mining field, as either a data scientist or a data engineer, or simply improve your skills,...

How to create an ecommerce purchase intention model in Python

Ecommerce purchase intention models analyse click-stream consumer behaviour data from web analytics platforms to predict whether a customer will make a purchase during their visit. These online shopping models are...

How to create a classification model using XGBoost in Python

The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become...

How to predict employee churn using CatBoost

In the field of HR analytics, data scientists are now using employee data from their human resources department to predict employee churn. The techniques for predicting employee churn are fairly...

How to create a basic Marketing Mix Model in scikit-learn

Marketing Mix Models (MMMs) utilise multivariate linear regression to predict sales from marketing costs, and various other parameters. A Marketing Mix Model (also called a Media Mix Model), even at...

How to use the Isolation Forest model for outlier detection

Outliers, or anomalies, can impact the accuracy of both regression and classification models, so detecting and removing them is an important step in the machine learning process. On larger datasets,...

How to use k means clustering for customer segmentation

K means is one of the most widely used algorithms for clustering data and falls into the unsupervised learning group of machine learning models. It’s ideal for many forms of...

How to assign RFM scores with quantile-based discretization

RFM segmentation is one of the oldest and most effective ways to segment customers. RFM models are based on three simple values - recency, frequency, and monetary value - which...

How to use bagging, boosting, and stacking in ensembles

Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several...

How to create a Naive Bayes product classification model

Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that...

How to use knee point detection in k means clustering

When using the k means clustering algorithm, you need to specifically define k, or the number of clusters you want the algorithm to create. Rather than selecting an arbitrary value,...

How to detect sarcasm using machine learning

I love sarcasm, but unfortunately I have a shaky ability to easily detect it in the voices of others, an aptitude for misinterpreting serious comments for sarcasm and then inappropriately...

How to detect fake news with machine learning

Long before Donald Trump erroneously applied it to mean “news that he didn’t agree with”, the term “fake news” referred to disinformation and misleading editorial content. In recent years, it’s...

The four Python data science libraries you need to learn

There are hundreds of excellent Python data science libraries and packages that you’ll encounter when working on data science projects. However, there are four of them that you’ll probably use...

How to use SMOTE for imbalanced classification

Imbalanced classification problems, such as the detection of fraudulent card payments, represent a significant challenge for machine learning models. When the target class, such as fraudulent transactions, makes up such...

How to use Recursive Feature Elimination in your models using RFECV

Something which often confuses non data scientists is that too many features can be a bad thing for a model. It does sound logical that including more features and data...

How to use model selection and hyperparameter tuning

There are many techniques you can apply to improve the performance of your machine learning models, but two of the most powerful are model selection and hyperparameter tuning. As models...

How to use transform categorical variables using encoders

There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on...

How to save and load machine learning models using Pickle

Machine learning models often take hours or days to run, especially on large datasets with many features. If your machine goes off, you’ll lose your model and you’ll need to...

How to create a response model to improve outbound sales

The predictive response models used to help identify customers in marketing can also be used to help outbound sales teams improve their call conversion rate by targeting the best people...

How to create a linear regression model using Scikit-Learn

Linear regression models are widely used in every industry. They predict a number from a range of other features based on a linear relationship between the input variables (X) and...

How to create synthetic data sets for machine learning

While there are many open source datasets available for you to use when learning new data science techniques, sometimes you may struggle to find a data set to use to...

How to create an ABC inventory classification model

ABC inventory classification has been one of the most widely used methods of stock control in operations management for decades. It’s an intentionally simple system in which products are assigned...

How to bin or bucket customer data using Pandas

Data binning, bucketing, or discrete binning, is a very useful technique for both preprocessing and understanding or visualising complex data, especially during the customer segmentation process. It’s applied to continuous...

How to use Category Encoders to encode categorical variables

Most datasets you’ll encounter will probably contain categorical variables. They are often highly informative, but the downside is that they’re based on object or datetime data types such as text...

A quick guide to machine learning

Machine learning (ML) is a branch of artificial intelligence (AI) in which models are created to predict an outcome by learning from patterns present in data. They can automatically improve...

How to tune model hyper-parameters with grid search

Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called...

How to use your GPU to accelerate XGBoost models

If you’re not fortunate enough to have a really powerful data science workstation for your work, one of the problems you’ll likely face is that your models can take quite...

How to use scikit-learn datasets in data science projects

The scikit-learn package comes with a range of small built-in toy datasets that are ideal for using in test projects and applications. As they’re part of the scikit-learn package, you...

How to interpret the confusion matrix

As a practical demonstration of how the confusion matrix works, lets load up the Wisconsin Breast Cancer dataset, create a classification model and examine the confusion matrix to see how...

How to impute missing numeric values in your dataset

As models require numeric data and don’t like NaN, null, or inf values, if you find these within your dataset you’ll need to deal with them before passing the data...