Pandas

139 articles tagged Pandas

How to use the Pandas truncate() function

Have you ever needed to chop the top or bottom off a Pandas dataframe, or extract a specific section from the middle? If so, there’s a Pandas function called truncate()...

How to use the Pandas filter() function

The Pandas filter() function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only...

How to use Pandas shift() to create lagged variables

The Pandas shift() function is used to shift the position of a dataframe or series by a specified number of periods. It’s commonly used for the creation of so-called lagged...

How to use Pandas to_json() to export JSON data

The Pandas to_json() function is one of a number of Pandas functions that allow you to export the data stored in a dataframe into other formats, in this case JavaScript...

How to use the Pandas query() function

The Pandas query() function is an awesome tool for filtering Pandas dataframes. It takes simple string arguments on column names and uses standard Pandas operators that allow you to easily...

How to use Pandas from_records() to create a dataframe

Pandas’ versatility means that there are loads of different ways to create a dataframe. The Pandas from_dict() function is one of the most common ways to create a dataframe from...

How to calculate an exponential moving average in Pandas

Simple moving averages, or SMAs, show the average value for a numeric value over a specific number of previous periods and are very useful in time series analysis, both as...

How to use the Pandas map() function

The Pandas map() function can be used to map the values of a series to another set of values or run a custom function. It runs at the series level,...

How to use Pandas pipe() to create data pipelines

The Pandas pipe() function takes a dataframe as its input, transforms or manipulates it, and returns the transformed dataframe. It is a very useful function that can be used to...

How to use Pandas assign() to create new dataframe columns

The Pandas assign() function is used to create new columns in a dataframe, usually based on calculations. The assign() function takes the name of the new column to create along...

How to use Pandas show_versions() to view package versions

The Pandas library is under constant development and new features are added regularly. This means that code you may read about online may not work if you are running an...

How to use Pandas from_dict() to create a dataframe

The Pandas library is so versatile that it provides several ways to create a dataframe. One of the most commonly used is the from_dict() method, which allows you to create...

How to use method chaining in Pandas

Pandas method chaining, or flow programming, is a modern, but sometimes controversial way of structuring Pandas code into a structured chain or series of commands. Conceptually, Pandas chaining is a...

How to round values in Pandas using round()

When working with numeric data in Pandas you’ll often need to round numbers to the nearest whole number, round them up, round them down, or round them to two decimal...

How to transpose a Pandas dataframe using T and transpose()

When working with Pandas dataframes that contain many columns, or those containing very large amounts of content, it is often useful to display the dataframe by flipping its orientation through...

How to use Pandas to_numeric() to convert strings to numbers

Many Pandas functions require data to be stored in the correct data type, or dtype as it’s known. For example, “£32,232.92” will be recognised as an object data type because...

How to use the Pandas set_index() and reset_index() functions

While many Pandas operations don’t require or benefit from an explicitly named index on the dataframe, named indexes (or indices) can be beneficial for some tasks because a wide range...

How to use lambda functions in Pandas

Lamdba functions are small anonymous functions that don’t need to be defined with a name. If you’re creating a function to solve a specific problem in Pandas and there’s little...

How to measure and reduce Pandas memory usage

While Pandas handles large datasets rather well, it can sometimes struggle with memory in certain situations. Thankfully, there are a few things you can do to reduce the amount of...

How to calculate percentage change between columns in Pandas

When working with Pandas dataframes you’ll often need to calculate the percentage change or percentage difference between the values in two columns. There are various ways to do this in...

How to calculate Spearman's rank correlation coefficient in Pandas

Spearman’s rank correlation coefficient, sometimes called Spearman’s rho, is a nonparametric statistic used to measure rank correlation, or the statistical dependence between the rankings of two variables. It explains how...

How to do custom Named Entity Recognition in Pandas using Spacy

As I showed in my previous tutorial on named entity recognition in Spacy, the EntityRuler allows you to customise Spacy’s default NER model to allow you to create your own...

How to calculate a rolling average or rolling mean in Pandas

The Pandas rolling() method can be used to calculate a rolling mean or rolling average (also known as a moving average), which is simply the mean of a specific time...

How to reorder Pandas dataframe columns

As you add new columns to Pandas dataframes they’ll often start to get large and the columns may appear in an order that no longer makes sense. To make your...

How to split strings using the Pandas split() function

The Pandas split() function lets you split a string value up into a list or into separate dataframe columns based on a separator or delimiter value, such as a space...

How to use Pandas explode() to split a list column into rows

When dealing with more complex datasets, you’ll often find that Pandas dataframe columns contain data stored as Python lists. While these are easy to store, they do take a little...

How to use Pandas std() to calculate standard deviation

Standard deviation, STD or STDEV, is a descriptive statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance....

How to use Pandas sample() to show a sample of data

The Pandas sample() function is used to show a random sample of data from a dataframe. The sample() function is useful for quickly checking the data in a dataframe, and...

How to use Pandas concat() to concatenate dataframes

The Pandas concat() function is used to concatenate (or join together) two or more Pandas objects such as dataframes or series. It can be used to join two dataframes together...

How to get and set Pandas cell values with at[] and iat[]

The Pandas at[] and iat[] methods can be used to get and set the values of specific cells in a Pandas dataframe. The at[] method is used to get and...

How to use pop() to drop a Pandas dataframe column

While the Pandas drop() method is probably the most common way to drop columns or remove columns from a Pandas dataframe, there is another lesser known method you can also...

How to use Pandas head() and tail() to get the first and last rows

One of the first things you’ll do whenever you import a Pandas dataframe is view the data to check that it’s formatted correctly and see what you’re dealing with. It’s...

How to use append() to add rows to a Pandas dataframe

The Pandas append() function is commonly used for appending or adding new rows to the bottom of an existing Pandas dataframe, or joining or concatenating dataframes vertically. However, while still...

How to prefix or suffix Pandas column names and values

When working with Pandas dataframes it really helps to have clear and consistent naming conventions for column labels or column names, and for the column values themselves. Adding a prefix...

How to find the most common value in a Pandas dataframe column

When working with categorical data in Pandas dataframes, it can help to get an understanding of the number of times a given value appears - a feature called “cardinality.” The...

How to drop Pandas dataframe rows and columns

When working with Pandas dataframes you’ll often need to drop, remove, or delete columns or rows from a dataframe to leave you with a clean dataframe containing tidier data you...

How to calculate Pearson correlation coefficient in Pandas

The Pearson correlation coefficient, or PCC, is the standard statistical method for computing pairwise or bivariate correlation in Pandas. It’s so commonly used in statistics, that it is often referred...

How to use Pandas date_range() to create date ranges

Pandas includes some incredible features for working with dates and times. The Pandas date_range() function is used to create a range of dates and can be used to create a...

How to slugify column names and values in Pandas

Slugification is the process of removing non-alphanumeric characters from a string and replacing spaces with underscores. Slugifying data is really useful for data scientists and can be used to both...

How to identify and remove duplicate values in Pandas

Duplicate values are a common occurrence in data science, and they come in various forms. Not only will you need to be able to identify duplicate values, but you will...

How to identify and count unique values in Pandas

When working with Pandas, you’ll often need to identify and count unique values in a DataFrame. This is a common task in data science, and Pandas provides two methods to...

How to use sort_values() to sort a Pandas DataFrame

When working with Pandas dataframes you’ll commonly need to sort the data in some way. This is easy to do with the sort_values() and sort_index() methods. These two methods allow...

How to use Pandas CategoricalDtype to create custom sort orders

When working with Pandas, you’ll often need to sort a dataframe by one or more columns. While the Pandas sort_values() method makes it easy to sort categorical data in alphabetical...

How to convert a Pandas dataframe or series to a list

When working with a Pandas dataframe you’ll sometimes need to convert the dataframe or a series to a list or dictionary. There are certain operations that are easier to perform...

How to add a new column to a Pandas dataframe

Pandas is extremely versatile and includes a wide range of different methods you can use to add a new column or series to an existing dataframe. Whether you want to...

How to convert a column list of dictionaries to a Pandas dataframe

When working with Pandas dataframes, you may sometimes encounter a column that contains a list of Python dictionaries or JSON objects. While this format doesn’t take up much space and...

How to transform numeric Pandas dataframe column values

When working with Pandas dataframes you’ll often need to convert values from one format to another. For example, you might need to convert a string to a float or an...

How to calculate the difference and percentage change between rows in Pandas

When working with Pandas dataframes, it’s a very common task to calculate the difference between two rows. For example, you might want to calculate the difference in the number of...

How to use isna() to check for missing values in a Pandas dataframe

Real world data is rarely clean, and you’ll often encounter missing values when working with Pandas dataframes. Missing values can lead to errors in your code, and can cause models...

How to change Pandas dataframe settings and options

When loading data into a Pandas dataframe, you’ll often find that data is truncated, columns are replaced with an ellipsis, or that the float precision makes numbers harder to read....

How to identify and change Pandas dtypes using info() and astype()

Data comes in many forms, from integers and floats, to strings, dates, and timedeltas. These different types of data are known as data types, or in Pandas dtypes, and using...

How to find the differences between two Pandas dataframes

When working with data, one common thing you’ll be tasked with doing is identifying what’s changed. For example, let’s say you’ve used your web scraping skills to build an ecommerce...

24 tutorials to get you started using Pandas for data science

Pandas, the Python Data Analysis Library, is the number one tool in data science and is a great reason to start learning Python programming. Irrespective of the data science project...

How to split a Pandas column string or list into separate columns

When working in Pandas you’ll sometimes encounter data stored in a single column that would actually be better presented when split into separate columns. For example, a Pandas column might...

How to export data from Pandas dataframes

The Pandas package is one of the main reasons why so many data scientists favour Python over Microsoft Excel. Pandas is incredibly powerful and versatile and can handle a wide...

How to bin data in Pandas with cut() and qcut()

Whether you call it data binning, data bucketing, or data discretization, the technique of grouping numeric data together is an exceptionally powerful one in data science, statistics, and machine learning....

How to analyse ecommerce coupon uplift with GAPandas

In ecommerce, coupons, voucher codes, or discount codes are widely used for meeting a range of different sales objectives. They can encourage new customers to make their first purchase, encourage...

How to query the Google Analytics Data API for GA4 using Python

Google recently announced that it will be sunsetting Universal Analytics and replacing it with Google Analytics 4. The news sent shock waves through the ecommerce and marketing world, as it...

How to get a list of the dimensions and metrics in your GA4 property

Google Analytics 4 uses a completely different set of dimensions and metrics to Google Analytics 3 or Universal Analytics. In this project I’ll show you how to get back a...

How to create an ABC customer segmentation in Pandas

ABC classification is a simple technique that is commonly used in inventory management and is based on the Pareto principle or 80/20 rule. This says that 80% of consequences come...

How to rename columns in Pandas dataframes

Renaming Pandas dataframe columns is a common task for the data scientist. Neat, consistent column names make your dataframe easier to read and your code cleaner to write and maintain....

How to calculate the time difference between two dates in Pandas

Calculating the time difference between two dates in Pandas can yield useful information to aid your analysis, help you understand the data, and guide a machine learning model on making...

How to use the Pandas apply function on dataframe rows and columns

The Pandas apply() function allows you to run custom functions on the values in a Series or column of your Pandas dataframe. The Pandas apply function can be used for...

How to create descriptive statistics using the Pandas describe function

The Pandas describe() function generates descriptive statistics on the contents of a Pandas dataframe to show the central tendency, shape, distribution, and dispersion of variables. Examining descriptive statistics is the...

How to run time-based SEO tests using Python

One of the problems with search engine optimisation or SEO is that search engine algorithms are essentially black boxes. They analyse so many on-page and off-page factors, and use multiple...

How to identify SEO keyword opportunities with Python

One of the most useful Python SEO projects you can undertake is to identify the top keywords for which each of your site’s pages are ranking for. Sometimes, these keywords...

How to add days and subtract days from dates in Pandas

If you regularly work with time series data, one common thing you’ll need to do is add and subtract days from a date. If you tried doing this by hand,...

How to identify striking distance keywords with Python

Striking distance keywords are those which appear just off the bottom of the first page of search engine results. Keywords that appear on the first page have the greatest visibility...

How to create monthly Google Search Console API reports with EcommerceTools

Google Search Console is a really useful tool for search marketers since it shows what is happening data-wise before organic search visitors reach your website. Google Analytics only shows you...

How to use the Mailchimp Marketing Python API with Pandas

In ecommerce, email marketing remains one of the most effective (and cost-effective) digital marketing techniques, especially when combined with data science techniques. The vast amounts of customer data generated in...

How to use the eBay Finding API with Python

The eBay Finding API gives you direct access to eBay search listings using a simple SDK. This API lets you search or query eBay to fetch specific search listings for...

How to export Zendesk tickets into Pandas using Zenpy

The Zendesk customer service platform is widely used by ecommerce businesses, but its functionality for analysing ticket trends and automatically classifying them is somewhat limited. In many cases, you might...

How to query the Google Search Console API with EcommerceTools

The Google Search Console (GSC) API is a great source of information for those working in SEO, marketing, or ecommerce. It can tell you which of your pages are appearing...

How to read Google Sheets data in Pandas with GSpread

GSpread is a Python package that makes it quick and easy to read and write data from Google Sheets spreadsheets stored in your Google Drive into Python. With a tiny...

How to calculate the Lin Rodnitzky Ratio using GAPandas

The Lin Rodnitzky Ratio is a calculation designed to help search engine marketers assess the management of paid search campaigns and account structure. When managing paid search advertising accounts you...

How to analyse product replenishment

Subscription commerce was all the rage for a while, but it’s not really become as popular as many in ecommerce perhaps envisaged. While we may have subscriptions for certain things,...

Data science courses for budding data scientists and data engineers

If you want to change careers and move into the data science or data mining field, as either a data scientist or a data engineer, or simply improve your skills,...

How to read an RSS feed in Python

RSS feeds have been a mainstay on the web for over 20 years now. These XML-based documents are generated by web servers and designed to be read in RSS feed...

19 Python SEO projects that will improve your site

Although I have never really considered myself a technical SEO, I do need to do quite a bit of SEO work as part of my role as an Ecommerce Director....

How to identify internal and external links using Python

Internal linking helps improve the user experience by recommending related content to users, which both reduces bounce rate, and helps search engine optimisation efforts. While there are no hard and...

How to scrape Google results in three lines of Python code

EcommerceTools makes it really quick and easy to scrape Google search engine results in Python. In this simple project, we’ll use EcommerceTools to search Google for your chosen keywords, use...

How to create a simple product recommender system in Pandas

Product recommender systems, or recommendation systems, as they’re also known are ubiquitous on e-commerce websites these days. They’re relatively simple to create and even fairly basic ones can give striking...

How to create PDF reports in Python using Pandas and Gilfoyle

While reporting is often quite a useful way to stay on top of your data, it’s also something you can automate to save time, even if your reports include custom...

How to create monthly Google Analytics reports in Pandas

Like most people who work in ecommerce and marketing, I spend a lot of time in Google Analytics. It’s a great tool, but when reporting on the numbers, it helps...

How to segment customers using RFM and ABC

While the Recency, Frequency, Monetary value or RFM model for customer segmentation might be old, it’s based on sound science, so no matter what customer model you’re building, it’s generally...

How to perform a customer cohort analysis in Pandas

Cohort analysis is unlike most other customer segmentation techniques in that it typically uses a time-based element. It’s typically used to segment customers into groups, or cohorts, based on their...

How to calculate CLV using BG/NBD and Gamma-Gamma

Calculating Customer Lifetime Value or CLV is considered a really important thing in marketing and ecommerce, yet most companies can’t do it properly. This clever metric tells you the predicted...

How to assign RFM scores with quantile-based discretization

RFM segmentation is one of the oldest and most effective ways to segment customers. RFM models are based on three simple values - recency, frequency, and monetary value - which...

How to engineer customer purchase latency features

Purchase latency or customer latency is a measure of the number of days between a customer’s orders and is one of the most powerful features in many propensity and churn...

How to create targeted B2B company sector datasets

As I explained in my previous post, many B2B ecommerce businesses spend huge amounts on procuring third-party data for companies they wish to target. However, with some data science skills...

How to unserialize serialized PHP arrays using Python

If you regularly work with ecommerce data, you’re likely to have encountered PHP serialized arrays or objects. Serialization is a process used to take a complex data structure, such as...

How to identify keyword cannibalisation using Python

Keyword cannibalisation occurs when you have several pages ranking for the same phrase, effectively putting them in competition with each other for search engine rankings. Since Google generally only shows...

How to analyse Pandas dataframes using SQL with PandaSQL

If, like me, you’ve come from a background where you made heavy use of SQL, then getting to grips with filtering, subsetting, and selecting data in Pandas can be a...

How to geocode and map addresses using GeoPy

In the field sales sector, one common thing you’ll want to do is identify all the potential clients you have within a particular region, so you can assign your team...

How to create paid search keywords using Pandas

Setting up keywords for new paid search accounts can be a repetitive and time-consuming process. While it’s historically been done using Excel, many digital marketers are now taking advantage of...

How to use the Pandas value_counts() function

The Pandas value_counts() function can be used to count the number of times a value occurs within a dataframe column or series, as well as calculating frequency distributions. Here’s a...

How to query MySQL and other databases using Pandas

For years, I used to spend much of my time performing Exploratory Data Analysis directly in SQL. Over time, the queries I wrote became very complicated, and it was often...

The four Python data science libraries you need to learn

There are hundreds of excellent Python data science libraries and packages that you’ll encounter when working on data science projects. However, there are four of them that you’ll probably use...

How to visualise data using line charts in Seaborn

Line charts, line graphs, or line plots are among the most widely used data visualisations. They’re ideal for time series data in which you’re plot a metric on the y...

How to visualise data using barplots in Seaborn

Barplots or bar charts are probably the most widely used visualisation for displaying and comparing categorical variables. They’re very easy to understand and are quick and easy to generate.

How to visualise correlations using Pandas and Seaborn

Pearson’s product-moment correlation, or Pearson’s r, is a statistical method commonly used in data science to measure the strength of the linear relationship between variables. If you can identify existing...

How to visualise categorical data in Seaborn

Categorical data can be visualised in many ways, and there’s no requirement to stick to the standard bar chart. Here are a selection of attractive Seaborn charts, graphs, and plots...

How to create a dataset for product matching models

Product matching (or data matching) is a computational technique employing Natural Language Processing, machine learning, or deep learning, which aims to identify identical products being sold on different websites, where...

How to visualise analytics data using heatmaps in Seaborn

Heatmaps are one of the most intuitive ways to display data across two dimensions, and they work particularly well on temporal data, such as web analytics metrics. They’re a great...

How to visualise RFM data using treemaps

Recent papers on the Recency, Frequency, Monetary or RFM model, such as the one by Inanc Kabasakal in 2020, have started to adopt text-based labels to help people understand the...

How to visualise data using scatterplots in Seaborn

Scatterplots, scatter graphs, scatter charts, or scattergrams, are one of the most popular mathematical plots and represent one of the best ways to visualise the relationship of data on two...

How to visualise data using histograms in Pandas

During the Exploratory Data Analysis or EDA stage one of the key things you’ll want to do is understand the statistical distribution of your data. Histograms are one of the...

How to visualise data using boxplots in Seaborn

The boxplot, or box-and-whisker diagram, is one of the most useful ways to visualise statistical distributions in data. While they can seem a bit unintuitive when you first look at...

How to use SMOTE for imbalanced classification

Imbalanced classification problems, such as the detection of fraudulent card payments, represent a significant challenge for machine learning models. When the target class, such as fraudulent transactions, makes up such...

How to use Recursive Feature Elimination in your models using RFECV

Something which often confuses non data scientists is that too many features can be a bad thing for a model. It does sound logical that including more features and data...

How to use transform categorical variables using encoders

There are loads of different ways to convert categorical variables into numeric features so they can be used within machine learning models. While you can perform this process manually on...

How to select, filter, and subset data in Pandas dataframes

Selecting, filtering and subsetting data is probably the most common task you’ll undertake if you work with data. It allows you to extract subsets of data where row or column...

How to resample time series data in Pandas

When working with time series data, such as web analytics data or ecommerce sales, the time series format in your dataset might not be ideal for the analysis you’re performing...

How to reformat dates in Pandas

If you regularly work with time series data in Pandas it’s probable that you’ll sometimes need to convert dates or datetimes and extract additional features from them.

How to import data into Pandas dataframes

Pandas allows you to import data from a wide range of data sources directly into a dataframe. These can be static files, such as CSV, TSV, fixed width files, Microsoft...

How to group and aggregate transactional data using Pandas

Transactional item data can be used to create a number of other useful datasets to help you analyse ecommerce products and customers. From the core list of items purchased you...

How to create ecommerce sales forecasts using Prophet

Time series forecasting models are notoriously tricky to master, especially in ecommerce, where you have seasonality, the weather, marketing promotions, and holidays to consider. Not to mention pandemics.

How to create a response model to improve outbound sales

The predictive response models used to help identify customers in marketing can also be used to help outbound sales teams improve their call conversion rate by targeting the best people...

How to analyse search traffic using the Google Trends API

The things we search for online can reveal a remarkable amount about us, even when viewed in aggregate on an anonymous level. For many years, Google has made some of...

How to use identify visually similar images using hashing

Image hashing (or image fingerprinting) is a technique that is used to convert an image to an alphanumeric string. While this might sound somewhat pointless, it actually has a number...

How to create an ABC XYZ inventory classification model

As everyone who works in ecommerce will know, stock-outs on your key lines can have a massive negative impact on sales and your marketing costs. In many cases, you’ll be...

How to speed up the NLP text annotation process

When you’re building a Natural Language Processing model, it’s the text annotation process which is the most laborious and the most expensive for your business. While you can use tools...

How to import data into Google Data Studio using Python

Google Data Studio has native support for a range of platforms, but there’s no reliable means of pushing data in from Python without going via another data source. Google BigQuery...

How to import data into BigQuery using Pandas and MySQL

Google BigQuery is a “serverless” data warehouse platform stored in the Google Cloud Platform. The serverless approach means you don’t have to maintain a server yourself and Google looks after...

How to create synthetic data sets for machine learning

While there are many open source datasets available for you to use when learning new data science techniques, sometimes you may struggle to find a data set to use to...

How to create an ABC inventory classification model

ABC inventory classification has been one of the most widely used methods of stock control in operations management for decades. It’s an intentionally simple system in which products are assigned...

How to connect to MySQL via an SSH tunnel in Python

Many MySQL databases are configured to accept connections from other servers on the local network and will reject connections from remote machines. Ordinarily, you could work around this by creating...

How to bin or bucket customer data using Pandas

Data binning, bucketing, or discrete binning, is a very useful technique for both preprocessing and understanding or visualising complex data, especially during the customer segmentation process. It’s applied to continuous...

How to use Category Encoders to encode categorical variables

Most datasets you’ll encounter will probably contain categorical variables. They are often highly informative, but the downside is that they’re based on object or datetime data types such as text...

How to create an ecommerce trading calendar using Pandas

In both B2C and B2B ecommerce, special trading periods such as Christmas, Mothers’ Day, and Valentines’ Day can often greatly contribute to sales. Indeed, the introduction of Black Friday sales...

How to use the Pandas melt function to reshape wide format data

When you gain access to a new dataset, chances are, it’s probably not in the format you require for analysis or modeling. The most common problem you’ll encounter is datasets...

How to scrape competitor technology data in Python

In ecommerce, it pays to watch what your competitors are doing, so over the past decade or so in which I’ve managed ecommerce businesses, I’ve regularly undertaken competitor analyses. They’re...

How to create a Pandas dataframe

The massive versatility of Pandas means that you can create dataframes from almost any type of raw data. Whether you have a list, a list of lists, a dictionary, a...

How to create a collaborative filtering recommender system

Recommender systems, or recommendation engines as they’re also known, are everywhere these days. Whether you’re looking for books on Amazon, tracks on Spotify, movies on Netflix or a date on...

How to use GAPandas to view your Google Analytics data

Over the past decade I’ve written more Google Analytics API queries than I can remember. Initially, I favoured PHP for these (and still do for permanent web-based applications utilising GA...

How to use scikit-learn datasets in data science projects

The scikit-learn package comes with a range of small built-in toy datasets that are ideal for using in test projects and applications. As they’re part of the scikit-learn package, you...

How to use mean encoding in your machine learning models

When you’re building a machine learning model, the feature engineering step is often the most important. From your initial small batch of features, the clever use of maths and stats...

How to impute missing numeric values in your dataset

As models require numeric data and don’t like NaN, null, or inf values, if you find these within your dataset you’ll need to deal with them before passing the data...

How to engineer date features using Pandas

When dealing with temporal or time series data, the dates themselves often yield information that can vastly improve the performance of your model. However, to get the best from these...