Pandas, the Python Data Analysis Library, is the number one tool in data science and is a great reason to start learning Python programming. Irrespective of the data science project you’re working upon, nine times out of ten, it will use Pandas somewhere.
Although there’s a bit of a learning curve at first, Pandas quickly becomes intuitive and has many benefits over other systems, such as Microsoft Excel. If you want to succeed in data science, Pandas is the single Python library you need to master the most.
I’ve covered loads of different ways to use Pandas in previous tutorials, as well as practical ways you can apply Pandas to a wide range of data science problems in ecommerce and marketing. Work through these, and you’ll be up to speed in no time.
Pandas uses dataframes at its core. These are basically a bit like tables or spreadsheets that you can query with Pandas functions. Therefore, in any data science project your first step will typically be to create a Pandas dataframe, or import data into a Pandas dataframe.
Picture by Markus Spiske, Unsplash.
Pandas lets you create dataframes in a wide range of ways, including from a Python list, from a Python dictionary, from Python tuples, or from any combinations of the above.
If you have an existing dataset, such as spreadsheet, you can import it straight into Pandas. Pandas supports a wide range of different file types and formats, so you’ll generally be able to easily import data, not matter how it’s provided.
If you don’t have data stored in a CSV or other file you can also query it in MySQL and other databases directly using Pandas and its SQL connection functionality. If you regularly work with databases and write queries that include complex joins, this approach can be a game changer.
Once you’ve got data into your Pandas dataframe, you’ll almost certainly want to apply filters that select specific rows or columns, rename columns, or change the layout of the data. Pandas makes all these tasks relatively easy.
Picture by Pixabay, Pexels.
Selecting or filtering data is the process of applying rules to the dataframe to show only specific rows or columns based on their values, allowing you to filter your data to a subset of the original data. It’s something you’ll use day in, day out.
Renaming columns is also very useful. Data imported from various sources will often have inconsistent column naming conventions, or duplicate column names, so renaming column names can keep your code neater and easier to maintain.
Reshaping data is another common way to use Pandas. The
pivot() functions are both great for this and let you convert wide format data into long format, and long format into wide format.
Finally, once you’ve reshaped, renamed, or filtered your Pandas data, you can also export it to CSV and a wide range of other file formats. All of these are simple one line commands and are very easy to use.
Exploratory Data Analysis or EDA is one of the main uses for Pandas. It’s essentially the process of examining your data in detail to understand data types, statistical distributions, correlations, and to dig deeper to answer any questions you may have about the dataset.
describe() function is one of the most regularly used Pandas functions in EDA, and lets you create simple descriptive statistics on numeric and categorical values in the dataframe.
Pandas also lets you group and aggregate data, which can be a great way to gain a better understanding and generate new data and new variables.
Correlation analysis, usually via Pearson’s correlation, lets you identify whether any of your variables are correlated with each other, or a target variable. For example, maybe you have a churn dataset, and you want to find out which customers are most likely to churn. Correlation analysis can give you a good steer.
Pandas also lets you perform basic data visualisations to help you better understand your data. Many of these are built in, but you can also use Pandas with larger dedicated data visualisation libraries, such as Matplotlib and Seaborn.
If you’ve come from a SQL background and are finding it hard to work out how to query or filter your Pandas data during EDA, you might be pleased to read that you can even query Pandas data using SQL commands. I took the plunge and learned to do things the Pandas way, but it’s still a handy thing to be able to do if you need to.
Another powerful part of Pandas is the ability to be able to perform more complex functions on your data to calculate or create new numeric or categorical variables. These can both help you better understand or visualise your data or form part of the feature engineering step when using machine learning approaches.
Picture by Steve Johnson, Pexels.
apply() function is one you’ll want to learn. It lets you apply a function to columns or rows in a Pandas dataframe, so you can perform a calculation or assign a new value to a column.
Binning or bucketing data through discretization is also a great way to understand data or create new features for models. It basically converts continuous variables to categorical variables by placing data into groups called bins based on their values. It’s widely used in ecommerce and marketing and forms the basis of the popular RFM model.
Pandas is great for working with times, dates, and time series data and includes loads of handy features for manipulating calendrical and temporal data.
Models require numeric values and can’t use dates directly, so you’ll need to create date features that convert a date to a numeric value, such as a day number or year number.
Since dates can be represented in many different ways, you’ll also commonly need to reformat or convert dates from one representation to another. Pandas makes this fairly simple.
Resampling is another handy part of Pandas. This lets you group data by day, week, or month, so it’s easier to understand visualise. It’s great for EDA and modeling.
Pandas even understands calendars and can be used to identify holidays and business days. You can also create your own country-specific or business-specific calendars, which is great in ecommerce and marketing planning and analysis.
Unsurprisingly, with such rich functionality for manipulating times and dates, you can also calculate time differences in Pandas.
Similarly, you can also add days and subtract days from dates in Pandas to create other dates you can use in your calculations, analyses, or models.
Pandas isn’t limited to only working with local data stored in a CSV or other file. It can also read remote data and pull it into a Pandas dataframe. Here are a few ways you can achieve this.
Via the excellent GSpread package, you can use Pandas to read data from remote Google Sheets and write data back. This is a great way to get external data into your Python application, or push your data back out to share with others.
You can also read remote RSS feeds and pull the data into Pandas. The same approach works for XML, and I use this for analysing XML sitemaps, Google Shopping feeds, and a wide range of other marketing data.
If you work with Google BigQuery you can also use Pandas to connect a local MySQL database to Pandas and then push the data up to BigQuery.
Finally, one of the things I find myself doing most at work is analysing Google Analytics data, so much so that I’ve written a couple of Python packages to make this easier. GAPandas lets you query the GA3 or Universal Analytics API and pull data into Pandas, while GA4Pandas does the same for the newer GA4.
These tutorials don’t cover everything in Pandas, but they’re definitely all you need to get you up and running with the basics. Once you get to grips with the core features of Pandas, you can do almost anything with it.
Matt Clarke, Wednesday, August 24, 2022