How to measure and reduce Pandas memory usage

Picture by Valentine Tanasovich, Pexels.

10 minutes to read

Data Science Pandas

While Pandas handles large datasets rather well, it can sometimes struggle with memory in certain situations. Thankfully, there are a few things you can do to reduce the amount of memory Pandas uses to make it run faster and to allow you to work with larger datasets.

In this quick tutorial I’ll show you how you can use the Pandas info() function to measure the memory usage for a dataframe and show you a couple of handy tricks you can use to massively reduce overall memory usage and speed up your work.

Import data into Pandas

To get started, open a Jupyter notebook and import the Pandas library using the import pandas as pd naming convention, then use read_csv() to import a dataset. I’m importing a Google Analytics data export that includes the number of pageviews by date across a range of browsers, sources, mediums, user types, and device types. After importing the data we’ll slugify the column header names to tidy up the dataframe.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv', 
parse_dates=['Date'])
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
df.head()

	user_type	source	medium	browser	device_category	date	pageviews
0	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-31	3
1	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-14	1
2	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-07-14	1
3	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-07	1
4	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-12	1

Find Pandas memory usage using info()

Next, we’ll use the Pandas info() method to determine how much memory the dataframe is using. To do this, we call df.info(memory_usage='deep'). Pandas returns a dataframe telling us that all the columns are of the object data type, apart from the pageviews column which contains an int64. The whole dataframe is using 3.8 MB of memory.

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   user_type        10000 non-null  object        
 1   source           10000 non-null  object        
 2   medium           10000 non-null  object        
 3   browser          10000 non-null  object        
 4   device_category  10000 non-null  object        
 5   date             10000 non-null  datetime64[ns]
 6   pageviews        10000 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 3.2 MB

Use memory_usage() to calculate memory usage per column

To identify the worst offending columns we can use the Pandas memory_usage() function with the deep=True argument. This shows us a list of the columns and their memory usage, making it obvious that the datetime64[ns] and pageviews columns are stored very efficiently, but the object dtype columns are not.

df.memory_usage(deep=True)

Index                 128
user_type          680000
source             634832
medium             637573
browser            634986
device_category    634882
date                80000
pageviews           80000
dtype: int64

Reduce Pandas memory usage by reducing the number of columns

First, we’ll look at the simplest way to reduce memory usage. That’s simply to reduce the number of columns by creating a subset containing only those of interest. We can do that very easily by passing in a list of the column names to df[] and then reassigning the transformed dataframe to df_subset.

Obviously, this only works when you don’t actually need those other columns… Re-running df_subset.info(memory_usage='deep') shows us that we’ve reduced memory usage from 3.8 MB to 1.9 MB.

df_subset = df[['user_type', 'source', 'medium', 'pageviews']]
df_subset.head()

	user_type	source	medium	pageviews
0	New Visitor	(direct)	(none)	3
1	New Visitor	(direct)	(none)	1
2	New Visitor	(direct)	(none)	1
3	New Visitor	(direct)	(none)	1
4	New Visitor	(direct)	(none)	1

df_subset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_type  10000 non-null  object
 1   source     10000 non-null  object
 2   medium     10000 non-null  object
 3   pageviews  10000 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 1.9 MB

Reduce memory usage by converting columns to category

Thankfully, there is a much better way to reduce memory usage in Pandas, and it doesn’t cause any data loss. As we saw above, the original dataframe contains mostly string values stored as the object dtype.

However, Pandas includes another data type called the CategoricalDtype that allows you to store categorical data in a much more efficient manner, providing it is of fairly low cardinality relative to the number of rows in the dataframe.

df.head()

	user_type	source	medium	browser	device_category	date	pageviews
0	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-31	3
1	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-14	1
2	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-07-14	1
3	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-07	1
4	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-12	1

Cardinality is basically a fancy way of saying how many unique values exist within a given column. We can measure the cardinality of columns in Pandas dataframes in several ways. We can count the unique values using nunique() and we can view the unique values and their counts using value_counts().

Running df.shape shows us that we have 10,000 rows and 7 columns, while applying nunique() to each object column shows us we have fairly low cardinality, with 1, 3, 4, 17, and 19 unique values per column.

df.shape

(10000, 7)

print('user_type = ',  df['user_type'].nunique())
print('device_category = ',  df['device_category'].nunique())
print('medium = ',  df['medium'].nunique())
print('browser = ',  df['browser'].nunique())
print('source = ',  df['source'].nunique())

user_type =  1
device_category =  3
medium =  4
browser =  17
source =  19

Finally, we can use the Pandas assign() method to cast the data type on the object columns to category using the Pandas astype() function. We’ll do this for all columns with object data and then save the output back to df. Another really cool feature of category data types is that you can even apply a custom sort order using CategoricalDtype.

df = df.assign(
    user_type = df['user_type'].astype('category'),
    source = df['source'].astype('category'),
    medium = df['medium'].astype('category'),
    browser = df['browser'].astype('category'),
    device_category = df['device_category'].astype('category'),
)

If you re-run df.info(memory_usage='deep') on the transformed dataframe you’ll see that our object columns now show with the category dtype and the memory usage has dropped from 3.2MB to a mere 209.6 KB. We still have the same data present and nothing has been lost, but we’ve massively improved performance. The categorical columns now use less memory than the datetime64[ns] and int64 columns!

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   user_type        10000 non-null  category      
 1   source           10000 non-null  category      
 2   medium           10000 non-null  category      
 3   browser          10000 non-null  category      
 4   device_category  10000 non-null  category      
 5   date             10000 non-null  datetime64[ns]
 6   pageviews        10000 non-null  int64         
dtypes: category(5), datetime64[ns](1), int64(1)
memory usage: 209.6 KB

df.memory_usage(deep=True)

Index                128
user_type          10176
source             11904
medium             10426
browser            11733
device_category    10298
date               80000
pageviews          80000
dtype: int64

Matt Clarke, Sunday, January 01, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.