Data Science

249 articles and tutorials on data science using Python

How to use the Pandas truncate() function

Have you ever needed to chop the top or bottom off a Pandas dataframe, or extract a specific section from the middle? If so, there’s a Pandas function called truncate()...

How to use Spacy for noun phrase extraction

Noun phrase extraction is a Natural Language Processing technique that can be used to identify and extract noun phrases from text. Noun phrases are phrases that function grammatically as nouns...

How to use the Pandas filter() function

The Pandas filter() function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only...

How to use Pandas shift() to create lagged variables

The Pandas shift() function is used to shift the position of a dataframe or series by a specified number of periods. It’s commonly used for the creation of so-called lagged...

How to use Pandas to_json() to export JSON data

The Pandas to_json() function is one of a number of Pandas functions that allow you to export the data stored in a dataframe into other formats, in this case JavaScript...

How to use the Pandas query() function

The Pandas query() function is an awesome tool for filtering Pandas dataframes. It takes simple string arguments on column names and uses standard Pandas operators that allow you to easily...

How to use Pandas from_records() to create a dataframe

Pandas’ versatility means that there are loads of different ways to create a dataframe. The Pandas from_dict() function is one of the most common ways to create a dataframe from...

How to calculate an exponential moving average in Pandas

Simple moving averages, or SMAs, show the average value for a numeric value over a specific number of previous periods and are very useful in time series analysis, both as...

How to use the Pandas map() function

The Pandas map() function can be used to map the values of a series to another set of values or run a custom function. It runs at the series level,...

How to use Pandas pipe() to create data pipelines

The Pandas pipe() function takes a dataframe as its input, transforms or manipulates it, and returns the transformed dataframe. It is a very useful function that can be used to...

How to use Pandas assign() to create new dataframe columns

The Pandas assign() function is used to create new columns in a dataframe, usually based on calculations. The assign() function takes the name of the new column to create along...

How to measure Python code execution times with timeit

If you’re writing Python code in a Jupyter notebook that is eventually going to be used in production, it’s sensible to consider how long it takes to run. This is...

How to use Pandas show_versions() to view package versions

The Pandas library is under constant development and new features are added regularly. This means that code you may read about online may not work if you are running an...

How to use Pandas from_dict() to create a dataframe

The Pandas library is so versatile that it provides several ways to create a dataframe. One of the most commonly used is the from_dict() method, which allows you to create...

How to use method chaining in Pandas

Pandas method chaining, or flow programming, is a modern, but sometimes controversial way of structuring Pandas code into a structured chain or series of commands. Conceptually, Pandas chaining is a...

How to round values in Pandas using round()

When working with numeric data in Pandas you’ll often need to round numbers to the nearest whole number, round them up, round them down, or round them to two decimal...

How to transpose a Pandas dataframe using T and transpose()

When working with Pandas dataframes that contain many columns, or those containing very large amounts of content, it is often useful to display the dataframe by flipping its orientation through...

How to use Pandas to_numeric() to convert strings to numbers

Many Pandas functions require data to be stored in the correct data type, or dtype as it’s known. For example, “£32,232.92” will be recognised as an object data type because...

How to use the Pandas set_index() and reset_index() functions

While many Pandas operations don’t require or benefit from an explicitly named index on the dataframe, named indexes (or indices) can be beneficial for some tasks because a wide range...

How to use lambda functions in Pandas

Lamdba functions are small anonymous functions that don’t need to be defined with a name. If you’re creating a function to solve a specific problem in Pandas and there’s little...

How to measure and reduce Pandas memory usage

While Pandas handles large datasets rather well, it can sometimes struggle with memory in certain situations. Thankfully, there are a few things you can do to reduce the amount of...

How to calculate percentage change between columns in Pandas

When working with Pandas dataframes you’ll often need to calculate the percentage change or percentage difference between the values in two columns. There are various ways to do this in...

How to get a list of national holiday dates in Python

When working with ecommerce and marketing data in time series analysis projects, the dates of national holidays, or bank holidays, can make a big difference to customer behaviour so are...

How to use Spacy EntityRuler for custom Named Entity Recognition

Spacy’s EntityRuler component is one of several rule-based matcher components that can be used to extend the core functionality of the package. It’s really useful for the creation of custom...

How to calculate Spearman's rank correlation coefficient in Pandas

Spearman’s rank correlation coefficient, sometimes called Spearman’s rho, is a nonparametric statistic used to measure rank correlation, or the statistical dependence between the rankings of two variables. It explains how...

How to do custom Named Entity Recognition in Pandas using Spacy

As I showed in my previous tutorial on named entity recognition in Spacy, the EntityRuler allows you to customise Spacy’s default NER model to allow you to create your own...

How to calculate a rolling average or rolling mean in Pandas

The Pandas rolling() method can be used to calculate a rolling mean or rolling average (also known as a moving average), which is simply the mean of a specific time...

How to reorder Pandas dataframe columns

As you add new columns to Pandas dataframes they’ll often start to get large and the columns may appear in an order that no longer makes sense. To make your...

How to split strings using the Pandas split() function

The Pandas split() function lets you split a string value up into a list or into separate dataframe columns based on a separator or delimiter value, such as a space...

How to use Pandas explode() to split a list column into rows

When dealing with more complex datasets, you’ll often find that Pandas dataframe columns contain data stored as Python lists. While these are easy to store, they do take a little...

How to use Pandas std() to calculate standard deviation

Standard deviation, STD or STDEV, is a descriptive statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance....

How to use Pandas sample() to show a sample of data

The Pandas sample() function is used to show a random sample of data from a dataframe. The sample() function is useful for quickly checking the data in a dataframe, and...

How to use Pandas concat() to concatenate dataframes

The Pandas concat() function is used to concatenate (or join together) two or more Pandas objects such as dataframes or series. It can be used to join two dataframes together...

How to get and set Pandas cell values with at[] and iat[]

The Pandas at[] and iat[] methods can be used to get and set the values of specific cells in a Pandas dataframe. The at[] method is used to get and...

How to use pop() to drop a Pandas dataframe column

While the Pandas drop() method is probably the most common way to drop columns or remove columns from a Pandas dataframe, there is another lesser known method you can also...

How to use Pandas head() and tail() to get the first and last rows

One of the first things you’ll do whenever you import a Pandas dataframe is view the data to check that it’s formatted correctly and see what you’re dealing with. It’s...

How to use append() to add rows to a Pandas dataframe

The Pandas append() function is commonly used for appending or adding new rows to the bottom of an existing Pandas dataframe, or joining or concatenating dataframes vertically. However, while still...

How to prefix or suffix Pandas column names and values

When working with Pandas dataframes it really helps to have clear and consistent naming conventions for column labels or column names, and for the column values themselves. Adding a prefix...

How to find the most common value in a Pandas dataframe column

When working with categorical data in Pandas dataframes, it can help to get an understanding of the number of times a given value appears - a feature called “cardinality.” The...

How to drop Pandas dataframe rows and columns

When working with Pandas dataframes you’ll often need to drop, remove, or delete columns or rows from a dataframe to leave you with a clean dataframe containing tidier data you...

How to calculate Pearson correlation coefficient in Pandas

The Pearson correlation coefficient, or PCC, is the standard statistical method for computing pairwise or bivariate correlation in Pandas. It’s so commonly used in statistics, that it is often referred...

How to classify Google Search Console data in EcommerceTools

ABC analysis originally came from the field of inventory management, where it’s used by procurement staff to classify inventory items into three categories - A, B, and C - to...

How to use Pandas date_range() to create date ranges

Pandas includes some incredible features for working with dates and times. The Pandas date_range() function is used to create a range of dates and can be used to create a...

How to slugify column names and values in Pandas

Slugification is the process of removing non-alphanumeric characters from a string and replacing spaces with underscores. Slugifying data is really useful for data scientists and can be used to both...

How to identify and remove duplicate values in Pandas

Duplicate values are a common occurrence in data science, and they come in various forms. Not only will you need to be able to identify duplicate values, but you will...

How to identify and count unique values in Pandas

When working with Pandas, you’ll often need to identify and count unique values in a DataFrame. This is a common task in data science, and Pandas provides two methods to...

How to use sort_values() to sort a Pandas DataFrame

When working with Pandas dataframes you’ll commonly need to sort the data in some way. This is easy to do with the sort_values() and sort_index() methods. These two methods allow...

How to use Pandas CategoricalDtype to create custom sort orders

When working with Pandas, you’ll often need to sort a dataframe by one or more columns. While the Pandas sort_values() method makes it easy to sort categorical data in alphabetical...

How to convert a Pandas dataframe or series to a list

When working with a Pandas dataframe you’ll sometimes need to convert the dataframe or a series to a list or dictionary. There are certain operations that are easier to perform...

How to add a new column to a Pandas dataframe

Pandas is extremely versatile and includes a wide range of different methods you can use to add a new column or series to an existing dataframe. Whether you want to...

How to zip files and directories with Python

The zipfile module in Python provides a way to compress files and directories into a single zip file. This is useful for reducing the size of files and directories that...

How to list files and directories with Python

When working with Python you’ll often need to access files and directories on your computer. Python includes a useful os module that gives you access to your computer or server’s...

How to use a .gitignore file

The .gitignore file is a special file added to a Git repository to define the files and directories you do not wish to commit to your Git repository. This is...

How to use Spacy for POS tagging in Pandas

Spacy is one of the most popular Python packages for Natural Language Processing. Alongside the Natural Language Toolkit (NLTK), Spacy provides a huge range of functionality for a wide variety...

How to convert a column list of dictionaries to a Pandas dataframe

When working with Pandas dataframes, you may sometimes encounter a column that contains a list of Python dictionaries or JSON objects. While this format doesn’t take up much space and...

How to create a Shopify price tracker with Python

In ecommerce, it’s very common for retailers to need to monitor the prices of their competitors. Prices make a big difference to sales and if they’re set too high then...

How to create a QR code using Python

The QR code, or Quick Response Code, works by encoding data in a two-dimensional barcode. It is a type of matrix barcode, which is a two-dimensional barcode that uses a...

How to use NLTK for POS tagging in Pandas

The Natural Language Toolkit (NLTK) is a powerful Python package for performing a wide range of common NLP tasks, including Part of Speech tagging or POS tagging for short.

How to create GitLab issues using the Python GitLab API

GitLab is one of the most widely used project management tools in software development and data science. It’s similar to Jira and similar systems in that it provides a useful...

How to transform numeric Pandas dataframe column values

When working with Pandas dataframes you’ll often need to convert values from one format to another. For example, you might need to convert a string to a float or an...

How to calculate the difference and percentage change between rows in Pandas

When working with Pandas dataframes, it’s a very common task to calculate the difference between two rows. For example, you might want to calculate the difference in the number of...

How to calculate Customer Lifetime Value heuristics

Customer Lifetime Value or CLV (also erroneously called Lifetime Customer Value or Lifetime Value) is one of the most misunderstood of all marketing metrics. Weirdly, everyone in marketing understands its...

How to use isna() to check for missing values in a Pandas dataframe

Real world data is rarely clean, and you’ll often encounter missing values when working with Pandas dataframes. Missing values can lead to errors in your code, and can cause models...

How to resize images in Python using Pillow

Pillow is a fork of the Python Imaging Library (PIL) and is one of the most useful Python tools for resizing images. Pillow can actually perform a very wide range...

How to transcode a YouTube video to MP3 in Python

If you regularly watch YouTube videos that you want to listen to on your drive to work, you could consider downloading them and converting them to MP3. This is done...

How to change Pandas dataframe settings and options

When loading data into a Pandas dataframe, you’ll often find that data is truncated, columns are replaced with an ellipsis, or that the float precision makes numbers harder to read....

How to identify and change Pandas dtypes using info() and astype()

Data comes in many forms, from integers and floats, to strings, dates, and timedeltas. These different types of data are known as data types, or in Pandas dtypes, and using...

How to find the differences between two Pandas dataframes

When working with data, one common thing you’ll be tasked with doing is identifying what’s changed. For example, let’s say you’ve used your web scraping skills to build an ecommerce...

24 tutorials to get you started using Pandas for data science

Pandas, the Python Data Analysis Library, is the number one tool in data science and is a great reason to start learning Python programming. Irrespective of the data science project...

How to scrape a Shopify site in Python via products.json

Since many modern websites use JavaScript and JSON to build their pages, you can sometimes find public facing APIs buried in the page code that give you access to structured...

How to split a Pandas column string or list into separate columns

When working in Pandas you’ll sometimes encounter data stored in a single column that would actually be better presented when split into separate columns. For example, a Pandas column might...

How to export data from Pandas dataframes

The Pandas package is one of the main reasons why so many data scientists favour Python over Microsoft Excel. Pandas is incredibly powerful and versatile and can handle a wide...

How to bin data in Pandas with cut() and qcut()

Whether you call it data binning, data bucketing, or data discretization, the technique of grouping numeric data together is an exceptionally powerful one in data science, statistics, and machine learning....

How to calculate the profitability of BOGOF and multibuy promotions

Buy One Get One Free or BOGOF promotions, and similar multibuy promotions that provide a free item when customers purchase over a specified amount are very common in retailing, including...

How to analyse ecommerce coupon uplift with GAPandas

In ecommerce, coupons, voucher codes, or discount codes are widely used for meeting a range of different sales objectives. They can encourage new customers to make their first purchase, encourage...

How to use CSS and XPath custom extraction in Advertools

The Advertools web scraping package popular in the Python SEO community automatically extracts a wide range of page elements, such as the title, meta description, and various schema.org and OpenGraph...

How to scrape a website using Advertools

For larger web scraping projects, the Scrapy web scraping Python package is one of the most effective tools. It’s powerful and fast and have a huge range of features. However,...

How to query the Google Analytics Data API for GA4 using Python

Google recently announced that it will be sunsetting Universal Analytics and replacing it with Google Analytics 4. The news sent shock waves through the ecommerce and marketing world, as it...

How to get a list of the dimensions and metrics in your GA4 property

Google Analytics 4 uses a completely different set of dimensions and metrics to Google Analytics 3 or Universal Analytics. In this project I’ll show you how to get back a...

How to create an ABC customer segmentation in Pandas

ABC classification is a simple technique that is commonly used in inventory management and is based on the Pareto principle or 80/20 rule. This says that 80% of consequences come...

How to rename columns in Pandas dataframes

Renaming Pandas dataframe columns is a common task for the data scientist. Neat, consistent column names make your dataframe easier to read and your code cleaner to write and maintain....

How to visualise internal linking in Python using NetworkX graphs

Adding internal links to articles helps reduce bounce rate by promoting related content site visitors may find interesting, but it also has a powerful impact upon search engine optimisation or...

How to calculate the ecommerce KPIs you need to hit your revenue target

In ecommerce, you’ll typically be given a revenue target your site needs to hit every month. In my experience, these revenue targets are often proposed by finance directors, CEOs, or...

How to dedupe lists in Python with set() and intersection()

When working with Python lists you’ll often encounter times when you need to remove duplicate values present in a single list, remove duplicates found in multiple lists, or identify the...

How to use DATEDIFF() to calculate date differences in MySQL

When working with customer data or upon time series data science projects, you’ll often find the need to calculate the difference between two dates in your MySQL queries. The MySQL...

How to use DATE_ADD() and DATE_SUB() to add and subtract from dates

When working with customer data in ecommerce it’s very common for data scientists to need to add and subtract from date values directly within MySQL queries. For example, you might...

How to use CASE for flow control in SQL statements

The SQL CASE statement is used for flow control, much like an if, then, else statement. If the statement finds a match on the chosen condition it will return the...

How to use DATE_FORMAT() to reformat dates in MySQL

Most SQL databases store dates in the datetime format. This is really useful because MySQL, and other SQL dialects, make it very easy to convert datetime objects to a wide...

How to use string functions in SQL statements

MySQL includes a large number of string functions and operators that you can use in SQL statements to both query data and reformat column values. In this simple example, I’ll...

How to use ORDER BY to sort an SQL result set

When you create an SQL SELECT statement, the data probably won’t be returned or sorted in the order you want, so the ORDER BY clause is used to control the...

How to use GROUP BY and HAVING in SQL statements

The SQL GROUP BY clause groups row-based data into aggregated data, reducing the number of rows in the dataset, and is commonly used to perform aggregate calculations.

How to use BETWEEN in SQL statements to return data between two values

In SQL, when you want to SELECT data that lies between two values, there are a number of different SQL operators you can use to return the correct data. However,...

How to use SELECT, FROM, WHERE, and AND in SQL statements

The SELECT statement is the most simple of all SQL queries and allows you to retrieve the precise data you want from one or more tables, or even databases. In...

How to import a MySQL database

SQL is one of the most widely used languages in data science so it’s important to know at least the basics required in order to fetch the data you’ll need...

How to read QR codes in Python using OpenCV

The QR code or Quick Response code is a two-dimensional or matrix barcode invented in the early nineties by a Japanese car manufacturer. QR codes started to become popular in...

How to calculate month start and month end dates in Python

When creating business reports or running queries against a database or web analytics platform in a business setting, you’ll often need to know the start and end dates of the...

How to calculate ISO week numbers and start and end dates in Python

In ecommerce and marketing it’s relatively common to use ISO week numbers when reporting data. The ISO week system is a leap week calendar that forms part of the ISO...

How to use dictionaries in Python

Alongside the Python list, the dictionary is the most commonly used data storage structure in Python. Dictionaries allow you to store numeric and text-based data as a series of key-value...

How to analyse Average Order Value with Jenks natural breaks classification

Average Order Value or AOV is one of the most critical ecommerce metrics. Along with your sessions and conversion rate, it ultimately controls how much revenue an ecommerce business generates....

How to check if URLs are redirected using Requests

The requests HTTP library for Python allows you to make HTTP requests to servers and receive back HTTP status codes, site content, and other data. It’s extremely useful for building...

How to use operators in Google Analytics API queries

To extract specific data from the Google Analytics API you will often need to use segments and filters to ensure you get the data you want. For example, you might...

How to calculate abandonment and completion rates using the Google Analytics API

Google Analytics provides a useful Shopping Behaviour Analysis report that lets you examine the volumes of users who are performing important actions on your ecommerce website, such as viewing products,...

How to use try except for Python exception handling

Exceptions are events that can modify the flow of control through a Python application and are triggered when errors occur. When writing production code it’s a good idea to both...

How to use the Pip Python package manager

Pip is a command line application that allows you to install, upgrade, and remove Python packages from your development environment using simple commands. It works just like the Aptitude or...

How to calculate the time difference between two dates in Pandas

Calculating the time difference between two dates in Pandas can yield useful information to aid your analysis, help you understand the data, and guide a machine learning model on making...

How to use the Dropbox API with Python

Dropbox is one of the most widely used file storage and file sharing platforms and is used by a wide variety of businesses. Dropbox has put the effort into building...

How to send emails using the Mailchimp Transactional email API

While Mailchimp is best known for its email marketing functionality, it also includes an excellent transactional email API designed to allow developers to send transactional emails from their applications.

How to read an XML feed into a Pandas dataframe

XML feeds are a data format that uses Extensible Markup Language to provide structured data that can be read by search engines and online advertising providers. For example, a Google...

How to use the Pandas apply function on dataframe rows and columns

The Pandas apply() function allows you to run custom functions on the values in a Series or column of your Pandas dataframe. The Pandas apply function can be used for...

How to create descriptive statistics using the Pandas describe function

The Pandas describe() function generates descriptive statistics on the contents of a Pandas dataframe to show the central tendency, shape, distribution, and dispersion of variables. Examining descriptive statistics is the...

16 Python web scraping projects for ecommerce and SEO

Web scraping is a programming technique that uses a script or bot to visit one or more websites and extract specific elements or HTML tags from the source code of...

A quick guide to customer retention

There are two main ways you can grow the customer base of a business: you can either acquire more customers by increasing your customer acquisition rate, or you can improve...

How to create a Google rank checker tool using Python

Most off-the-shelf SEO tools come with a rank checker that allows you to monitor your position for given phrases in the Google search engine results. If you want to create...

How to use the Feefo API for ecommerce competitor analysis

Most ecommerce websites use review platforms, such as Feefo, Trustpilot, and Google Reviews, to allow customers to give feedback on their service and the products they sell. The reviews help...

How to compare time periods using the Google Search Console API

One common task you’ll perform in Google Search Console is to compare the data from two different time periods to see how impressions, clicks, click-through rate (CTR), or average position...

A quick guide to the RFM model for data scientists

The RFM model is probably one of the best known and most widely used customer segmentation models by data driven marketers. It’s used for both measuring customer value and predicting...

How to run time-based SEO tests using Python

One of the problems with search engine optimisation or SEO is that search engine algorithms are essentially black boxes. They analyse so many on-page and off-page factors, and use multiple...

How to create content recommendations using TF IDF

After work, when I’m not learning about data science, practising data science, or writing about data science, I like to browse classic car auction sites looking for cars I can’t...

A quick guide to customer segmentation for B2B e-commerce

Customer segmentation, and the similar and related field of market segmentation, are particularly relevant to the field of business-to-business (B2B) e-commerce. B2B customers often have a higher Customer Lifetime Value...

How to detect Google Search Console anomalies

There are some great anomaly detection models available for Python, which let you examine complex data for a wide range of different anomaly types. In this project, I’ll show you...

How to identify SEO keyword opportunities with Python

One of the most useful Python SEO projects you can undertake is to identify the top keywords for which each of your site’s pages are ranking for. Sometimes, these keywords...

How to add days and subtract days from dates in Pandas

If you regularly work with time series data, one common thing you’ll need to do is add and subtract days from a date. If you tried doing this by hand,...

How to analyse Google Analytics demographics and interests with GAPandas

The demographics and interests data provided in Google Analytics can be a useful way to understand who is visiting your site or purchasing your products, without the need to perform...

How to identify striking distance keywords with Python

Striking distance keywords are those which appear just off the bottom of the first page of search engine results. Keywords that appear on the first page have the greatest visibility...

A quick guide to lead scoring for B2B e-commerce sites

Lead scoring is a Customer Relationship Management (CRM) process that involves segmenting CRM contacts based on their likelihood to make a purchase. Lead scoring is applied to both existing customers...

How to trigger marketing automations using the Mailchimp API in Python

Mailchimp is one of the most widely used email service providers (or ESPs) in ecommerce and marketing. Since it is popular with those who only need its basic campaign features,...

How to create monthly Google Search Console API reports with EcommerceTools

Google Search Console is a really useful tool for search marketers since it shows what is happening data-wise before organic search visitors reach your website. Google Analytics only shows you...

How to use the Mailchimp Marketing Python API with Pandas

In ecommerce, email marketing remains one of the most effective (and cost-effective) digital marketing techniques, especially when combined with data science techniques. The vast amounts of customer data generated in...

How to use the eBay Finding API with Python

The eBay Finding API gives you direct access to eBay search listings using a simple SDK. This API lets you search or query eBay to fetch specific search listings for...

How to export Zendesk tickets into Pandas using Zenpy

The Zendesk customer service platform is widely used by ecommerce businesses, but its functionality for analysing ticket trends and automatically classifying them is somewhat limited. In many cases, you might...

How to query the Google Search Console API with EcommerceTools

The Google Search Console (GSC) API is a great source of information for those working in SEO, marketing, or ecommerce. It can tell you which of your pages are appearing...

How to read Google Sheets data in Pandas with GSpread

GSpread is a Python package that makes it quick and easy to read and write data from Google Sheets spreadsheets stored in your Google Drive into Python. With a tiny...

How to calculate the Lin Rodnitzky Ratio using GAPandas

The Lin Rodnitzky Ratio is a calculation designed to help search engine marketers assess the management of paid search campaigns and account structure. When managing paid search advertising accounts you...

How to analyse product replenishment

Subscription commerce was all the rage for a while, but it’s not really become as popular as many in ecommerce perhaps envisaged. While we may have subscriptions for certain things,...

Data science courses for budding data scientists and data engineers

If you want to change careers and move into the data science or data mining field, as either a data scientist or a data engineer, or simply improve your skills,...

A quick guide to customer segmentation for data scientists

Customer segmentation is the process of using data science techniques to create discrete groups of customers which share common characteristics or attributes. For example, a company might segment customers into...

How to read an RSS feed in Python

RSS feeds have been a mainstay on the web for over 20 years now. These XML-based documents are generated by web servers and designed to be read in RSS feed...

19 Python SEO projects that will improve your site

Although I have never really considered myself a technical SEO, I do need to do quite a bit of SEO work as part of my role as an Ecommerce Director....

How to identify internal and external links using Python

Internal linking helps improve the user experience by recommending related content to users, which both reduces bounce rate, and helps search engine optimisation efforts. While there are no hard and...

How to scrape Google results in three lines of Python code

EcommerceTools makes it really quick and easy to scrape Google search engine results in Python. In this simple project, we’ll use EcommerceTools to search Google for your chosen keywords, use...

How to create a simple product recommender system in Pandas

Product recommender systems, or recommendation systems, as they’re also known are ubiquitous on e-commerce websites these days. They’re relatively simple to create and even fairly basic ones can give striking...

15 ways you can use data science to boost ecommerce performance

Major internet retailers, like Walmart and Amazon, have been at the forefront of ecommerce data science and data analytics for many years, contributing lots of interesting papers to data science...

How to create PDF reports in Python using Pandas and Gilfoyle

While reporting is often quite a useful way to stay on top of your data, it’s also something you can automate to save time, even if your reports include custom...

How to create monthly Google Analytics reports in Pandas

Like most people who work in ecommerce and marketing, I spend a lot of time in Google Analytics. It’s a great tool, but when reporting on the numbers, it helps...

How to segment your customers using EcommerceTools

Customer segmentation can give you huge insights into your business and identify a whole range of different things about your customers, allowing you to change your marketing and improve results....

How to use EcommerceTools for technical SEO

There’s often a lot of faffing around required to get marketing and ecommerce data from various systems into Pandas so you can analyse it, or use it within more complex...

How to segment customers using RFM and ABC

While the Recency, Frequency, Monetary value or RFM model for customer segmentation might be old, it’s based on sound science, so no matter what customer model you’re building, it’s generally...

How to perform a customer cohort analysis in Pandas

Cohort analysis is unlike most other customer segmentation techniques in that it typically uses a time-based element. It’s typically used to segment customers into groups, or cohorts, based on their...

How to machine translate product descriptions

Whether you’re analysing content written in other languages using Natural Language Processing, or you want to assist your content team by translating their writing into other languages, machine translating software...

How to identify near duplicate content using LMS

In those ecommerce businesses where relatively few products are launched and products have a relatively long lifecycle, copywriters tend to be targeted on producing unique content that sells the benefits...

How to create a product and price metadata scraper

In ecommerce, price monitoring is a really important consideration. If you offer your products at a price which is too high within the market, you may lose sales to rivals,...

How to calculate CLV using BG/NBD and Gamma-Gamma

Calculating Customer Lifetime Value or CLV is considered a really important thing in marketing and ecommerce, yet most companies can’t do it properly. This clever metric tells you the predicted...

How to assign RFM scores with quantile-based discretization

RFM segmentation is one of the oldest and most effective ways to segment customers. RFM models are based on three simple values - recency, frequency, and monetary value - which...

How to analyse product consumption and repurchase rates

You can learn many things about your products from the purchase behaviours and product consumption and product replenishment of your customers. Some items are purchased individually, some items are purchased...

How to use Spintax to create content and ad copy in Python

If you’ve never heard of Spintax, you’ll definitely be aware of its typical usages. This text string replacement “language” is most commonly used for text spinning or article spinning, but...

How to scrape schema.org metadata using Python

As I’ve mentioned in previous posts on web scraping, the most efficient way to scrape data is to identify what Schema.org metadata is in use and then create a microdata...

How to scrape People Also Ask data using Python

People Also Ask or PAA boxes have been becoming increasingly common in Google’s search results over the past few years. They show a range of questions and answers related to...

How to scrape Google search results using Python

Although I suspect you are probably not technically allowed to do it, I doubt there’s an SEO in the land who hasn’t scraped Google search engine results to analyse them,...

How to join Google Analytics and Google Search Console data

Neither Google Search Console nor Google Analytics gives you access to the data found in both systems in one place. However, with a bit of ingenuity and some relatively simple...

How to identify SEO keywords using Google Autocomplete

The Google Autocomplete feature, or Google Suggest as it was previously known, has become a part of everyday life for us all. Start typing a search term into Google, and...

How to engineer customer purchase latency features

Purchase latency or customer latency is a measure of the number of days between a customer’s orders and is one of the most powerful features in many propensity and churn...

How to create targeted B2B company sector datasets

As I explained in my previous post, many B2B ecommerce businesses spend huge amounts on procuring third-party data for companies they wish to target. However, with some data science skills...

How to create a UK data science jobs dataset

According to the Harvard Business Review, the role of data scientist is said to be “the sexiest job of the 21st century”. Data science and data engineering skills are said...

How to create a dataset containing all UK companies

In B2B ecommerce, there are two main approaches to new customer acquisition: you either rely on your website to acquire customers for you, or you target specific customers through sales...

How to count indexed pages using Python

One quick and easy way to understand the size of a website, and its growth rate, is to examine the number of its web pages Google has indexed. You can...

How to calculate safety stock and reorder point

Although the techniques for reducing its impact have existed for decades, inventory management is still a huge issue in many businesses. Various things happen that can result in costly stock...

How to calculate operations management metrics in Python

Successful operations management is crucial to the overall growth of an ecommerce business. While those in ecommerce, marketing, or data science, can work together the sales coming in and encourage...

How to calculate marketing metrics in Python

Marketers can be just as obsessive about data as data scientists, so there are an abundance of well-researched marketing metrics available for analysing marketing performance. Most of the commonly used...

How to calculate customer experience metrics in Python

Customers are expensive to acquire but generate more and more profit as time goes on. Providing you nurture them, treat them kindly, and apologise and fix any mistakes that occur,...

How to calculate category management metrics in Python

Category management is a retail technique that breaks down a company’s product range into groups of related items, such as categories, or subcategories, or by their product type. By running...

How to access the Google Knowledge Graph Search API

The Google Knowledge Graph database includes an astronomical amount of data on almost every topic you can think of, allowing Google to create Knowledge Panels and infoboxes that summarise search...

A quick guide to catalogue marketing data science

Catalogue marketing is dying out. Over the past few years, virtually all the UK’s top catalogue retailers have stopped printing on paper and successfully transitioned their businesses online, either to...

How to use Extruct to identify Schema.org metadata usage

The downside to building datasets using web scraping is that every site has custom HTML. If you scrape sites in this way, you’ll forever be building bespoke scrapers, and they’ll...

How to unzip files with Python

Most very large datasets tend to get compressed on servers to preserve storage space and bandwidth and allow them to be downloaded more quickly by end users. Python includes some...

How to unserialize serialized PHP arrays using Python

If you regularly work with ecommerce data, you’re likely to have encountered PHP serialized arrays or objects. Serialization is a process used to take a complex data structure, such as...

How to send data to Google Analytics in Python with PyGAMP

The Google Analytics Measurement Protocol API lets you add data to your GA account that hasn’t been triggered by a user visiting a web page. Since it’s so flexible, you...

How to scrape Open Graph protocol data using Python

Many websites include Open Graph protocol data in their document head. This structured data allows social networks, such as Facebook and Twitter, to access specific elements of the page’s content...

How to scrape and parse a robots.txt file using Python

When scraping websites, and when checking how well a site is configured for crawling, it pays to carefully check and parse the site’s robots.txt file. This file, which should be...

How to scrape a site's page titles and meta descriptions

Scraping the titles and meta descriptions from every page on a site can tell you a great deal about its content, the underlying content strategy, or product ranges, and many...

How to scan a site for 404 errors and 301 redirect chains

Both 404 page not found errors and 301 redirect chains can be costly and damaging to the performance of a website. They’re both easy to introduce, especially on ecommerce sites...

How to resize and compress images in Python with the TinyPNG API

Large, uncompressed images slow down your site, increase bandwidth costs, harm the user experience, and impact search engine rankings. In this project, I’ll show you how you can bulk resize...

How to parse XML sitemaps using Python

XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs. However, they’re also a useful tool in competitor analysis and allow...

How to parse URL structures using Python

URLs often contain useful information that can be used to analyse a website, a user’s search, or the breakdown of content present in each section. While they often look pretty...

How to identify keyword cannibalisation using Python

Keyword cannibalisation occurs when you have several pages ranking for the same phrase, effectively putting them in competition with each other for search engine rankings. Since Google generally only shows...

How to download files with Python

In many data science projects you may need to download remote data, such as images, CSV files, or compressed data. Python makes it fairly straightforward to download files within your...

How to calculate Economic Order Quantity in Python

The Economic Order Quantity or EOQ represents the optimum purchase quantity for a given product, while aiming to minimise holding costs, shortage costs, and order costs. It’s most commonly calculated...

How to build a web scraper using Requests-HTML

Unless you’re building a large and complex web scraper using Scrapy or Selenium, it’s probable that you’ll utilise Requests and Beautiful Soup. These two packages are brilliant for web scraping....

How to audit a site's Core Web Vitals using Python

Back in 2020, Google introduced Web Vitals, a set of metrics is designed to help site owners to optimise the user experience on their websites, so pages are quick to...

How to analyse Pandas dataframes using SQL with PandaSQL

If, like me, you’ve come from a background where you made heavy use of SQL, then getting to grips with filtering, subsetting, and selecting data in Pandas can be a...

How to analyse non-ranking pages and search index bloat

If your site’s pages aren’t indexed by Google, you’re obviously not going to generate any traffic to them, so you’ll want to check that everything you expect to be present...

How to access the Google Search Console API using Python

Google Search Console contains loads of really useful information for technical SEO. However, there are limits to what you can do using the front-end interface, and it takes time to...

How to use Screaming Frog from the command line

The Screaming Frog SEO Spider Tool is widely used in digital marketing and ecommerce. It provides a user-friendly interface to a powerful site crawler and scraper that can be used...

How to send a Slack message in Python using webhooks

Slack is a great tool for data scientists and data engineers and is now being adopted across businesses, so it’s probable that you already use it in your workplace. Besides...

How to geocode and map addresses using GeoPy

In the field sales sector, one common thing you’ll want to do is identify all the potential clients you have within a particular region, so you can assign your team...

How to create paid search keywords using Pandas

Setting up keywords for new paid search accounts can be a repetitive and time-consuming process. While it’s historically been done using Excel, many digital marketers are now taking advantage of...

How to create a Python web scraper using Beautiful Soup

Web scraping is a really useful skill in data science. We obviously need data for our models and analyses, but it’s not always easily available, so building our own datasets...

How to write better code using DRY and Do One Thing

DRY, or Don’t Repeat Yourself, and the “Do One Thing” methodology are designed to help software engineers and data scientists create better functions. Code that isn’t written using DRY tends...

How to visualise data with quirky hand-drawn plots

Charts and plots can often look a bit stale and professional, which might not be appropriate in every setting. If you want to dumb-down your charts and make them look...

How to visualise conversion funnels with Plotly

Funnels are arguably one of the most powerful data visualisations you can use within the ecommerce field. At a glance, they can show you the proportion of customers entering at...

How to use style guidelines to improve your Python code

The flexibility of programming languages like Python means that any code you write to tackle a given problem will differ in approach and style to code written by someone else....

How to use SQLite in Python

SQLite is a relational database management system (RDBMS) that is easy to access within Python and other languages. Unlike MySQL, PostgreSQL, and other databases, SQLite uses a serverless design, so...

How to use operators in Python

For data scientists, Python operators are one of the most powerful and widely used features of this language. These special symbols or characters tell Python to perform some sort of...

How to use lists in Python

Lists are one of the most widely used data storage objects or data types within Python and are used throughout every data science package. Along with the dictionary, tuple, and...

How to use Git for your data science projects

Git is the world’s most widely used version control system and is an essential tool for data scientists, especially those collaborating on projects with others. You’ll need to be able...

How to use docstrings to improve your Python code

Docstrings are comment blocks that are added to the top of Python functions to explain the purpose of the function, describe the arguments that it accepts, and explain what the...

How to use the Pandas value_counts() function

The Pandas value_counts() function can be used to count the number of times a value occurs within a dataframe column or series, as well as calculating frequency distributions. Here’s a...

How to query MySQL and other databases using Pandas

For years, I used to spend much of my time performing Exploratory Data Analysis directly in SQL. Over time, the queries I wrote became very complicated, and it was often...

How to open, read, and write to files in Python

While data scientists may do nearly everything in Pandas, we also need to perform file operations in regular Python and in applications not tied to dataframes. Thankfully, Python makes it...

The four Python data science libraries you need to learn

There are hundreds of excellent Python data science libraries and packages that you’ll encounter when working on data science projects. However, there are four of them that you’ll probably use...

How to visualise text data using word clouds in Python

Word clouds (also known as tag clouds, wordles, or weighted lists) have been around since the mid nineties and are one of the most effective data visualisations for representing the...

How to visualise statistical distributions with Seaborn

One of the key steps in the Exploratory Data Analysis process that comes before model development is to understand the statistical distribution of the variables or features within the data...

How to visualise data using Venn diagrams in Matplotlib

The Venn diagram is one of the most intuitive data visualisations for showing the overlap between two or three groups, or “sets”, of data. These diagrams were created in the...

How to visualise data using line charts in Seaborn

Line charts, line graphs, or line plots are among the most widely used data visualisations. They’re ideal for time series data in which you’re plot a metric on the y...

How to visualise data using barplots in Seaborn

Barplots or bar charts are probably the most widely used visualisation for displaying and comparing categorical variables. They’re very easy to understand and are quick and easy to generate.

How to visualise correlations using Pandas and Seaborn

Pearson’s product-moment correlation, or Pearson’s r, is a statistical method commonly used in data science to measure the strength of the linear relationship between variables. If you can identify existing...

How to visualise categorical data in Seaborn

Categorical data can be visualised in many ways, and there’s no requirement to stick to the standard bar chart. Here are a selection of attractive Seaborn charts, graphs, and plots...

How to install the NVIDIA Data Science Stack on Ubuntu 20.04

One of the most annoying aspects of working with GPU-accelerated data science software, such as NVIDIA Rapids, TensorFlow, PyTorch and XGBoost, is that it can sometimes be very complicated and...

How to create desktop data science apps using Nativefier

There are numerous websites I use for my work that don’t have dedicated desktop applications designed for Ubuntu Linux, such as GitHub, GitHub Gists, GitLab and Jira. However, it’s now...

How to create an Ubuntu desktop entry to run Jupyter

Despite the massive improvements to usability on Linux over the years, it still remains unnecessarily complicated to create shortcut icons on Gnome.

How to build a data science workstation

If you’re working in data science, and especially if you’re working in deep learning, you’re going to need a decent workstation in order to be productive. Earlier this year I...

How to visualise analytics data using heatmaps in Seaborn

Heatmaps are one of the most intuitive ways to display data across two dimensions, and they work particularly well on temporal data, such as web analytics metrics. They’re a great...

How to visualise RFM data using treemaps

Recent papers on the Recency, Frequency, Monetary or RFM model, such as the one by Inanc Kabasakal in 2020, have started to adopt text-based labels to help people understand the...

How to visualise data using scatterplots in Seaborn

Scatterplots, scatter graphs, scatter charts, or scattergrams, are one of the most popular mathematical plots and represent one of the best ways to visualise the relationship of data on two...

How to visualise data using histograms in Pandas

During the Exploratory Data Analysis or EDA stage one of the key things you’ll want to do is understand the statistical distribution of your data. Histograms are one of the...

How to visualise data using boxplots in Seaborn

The boxplot, or box-and-whisker diagram, is one of the most useful ways to visualise statistical distributions in data. While they can seem a bit unintuitive when you first look at...

How to select, filter, and subset data in Pandas dataframes

Selecting, filtering and subsetting data is probably the most common task you’ll undertake if you work with data. It allows you to extract subsets of data where row or column...

How to resample time series data in Pandas

When working with time series data, such as web analytics data or ecommerce sales, the time series format in your dataset might not be ideal for the analysis you’re performing...

How to reformat dates in Pandas

If you regularly work with time series data in Pandas it’s probable that you’ll sometimes need to convert dates or datetimes and extract additional features from them.

How to import data into Pandas dataframes

Pandas allows you to import data from a wide range of data sources directly into a dataframe. These can be static files, such as CSV, TSV, fixed width files, Microsoft...

How to group and aggregate transactional data using Pandas

Transactional item data can be used to create a number of other useful datasets to help you analyse ecommerce products and customers. From the core list of items purchased you...

How to analyse search traffic using the Google Trends API

The things we search for online can reveal a remarkable amount about us, even when viewed in aggregate on an anonymous level. For many years, Google has made some of...

How to use identify visually similar images using hashing

Image hashing (or image fingerprinting) is a technique that is used to convert an image to an alphanumeric string. While this might sound somewhat pointless, it actually has a number...

How to create an ABC XYZ inventory classification model

As everyone who works in ecommerce will know, stock-outs on your key lines can have a massive negative impact on sales and your marketing costs. In many cases, you’ll be...

How to create an ABC inventory classification model

ABC inventory classification has been one of the most widely used methods of stock control in operations management for decades. It’s an intentionally simple system in which products are assigned...

How to connect to MySQL via an SSH tunnel in Python

Many MySQL databases are configured to accept connections from other servers on the local network and will reject connections from remote machines. Ordinarily, you could work around this by creating...

How to calculate relative dates for Google Analytics queries

The Google Analytics add-on for Google Sheets allows you to use the Google Analytics reporting API to create custom weekly reports and schedule them to run. However, to run a...

How to create an ecommerce trading calendar using Pandas

In both B2C and B2B ecommerce, special trading periods such as Christmas, Mothers’ Day, and Valentines’ Day can often greatly contribute to sales. Indeed, the introduction of Black Friday sales...

Dell Precision 7750 mobile data science workstation review

The Precision 7000 series is the top of the range mobile workstation laptop from Dell and is aimed firmly at professional users who are doing high-end GPU-accelerated work, whether that’s...

How to use the Pandas melt function to reshape wide format data

When you gain access to a new dataset, chances are, it’s probably not in the format you require for analysis or modeling. The most common problem you’ll encounter is datasets...

How to use the Apriori algorithm for Market Basket Analysis

Market Basket Analysis, or MBA, is a subset of affinity analysis and has been used in the retail sector for many years. It provides a computational method for identifying common...

How to scrape JSON-LD competitor reviews using Extruct

In the ecommerce sector, you can learn a lot about your competitors and the expectations of your customers by analysing the reviews their customers leave for products and service on...

How to scrape competitor technology data in Python

In ecommerce, it pays to watch what your competitors are doing, so over the past decade or so in which I’ve managed ecommerce businesses, I’ve regularly undertaken competitor analyses. They’re...

How to create a Pandas dataframe

The massive versatility of Pandas means that you can create dataframes from almost any type of raw data. Whether you have a list, a list of lists, a dictionary, a...

How to create a collaborative filtering recommender system

Recommender systems, or recommendation engines as they’re also known, are everywhere these days. Whether you’re looking for books on Amazon, tracks on Spotify, movies on Netflix or a date on...

How to use GAPandas to view your Google Analytics data

Over the past decade I’ve written more Google Analytics API queries than I can remember. Initially, I favoured PHP for these (and still do for permanent web-based applications utilising GA...

How to use scikit-learn datasets in data science projects

The scikit-learn package comes with a range of small built-in toy datasets that are ideal for using in test projects and applications. As they’re part of the scikit-learn package, you...

How to use Python regular expressions to extract information

Regular expressions are used for pattern matching in programming, allowing you to identify or extract very specific pieces of text from a string or document. They’re very powerful and extremely...

How to engineer date features using Pandas

When dealing with temporal or time series data, the dates themselves often yield information that can vastly improve the performance of your model. However, to get the best from these...

How to create a Python virtual environment for Jupyter

When you use pip to install Python packages from The Python Package Index (PyPi) they get stored in your site-packages directory and are used across your system whenever you run...