While Pandas handles large datasets rather well, it can sometimes struggle with memory in certain situations. Thankfully, there are a few things you can do to reduce the amount of memory Pandas uses to make it run faster and to allow you to work with larger datasets.
In this quick tutorial I’ll show you how you can use the Pandas info()
function to measure the memory usage for a dataframe and show you a couple of handy tricks you can use to massively reduce overall memory usage and speed up your work.
To get started, open a Jupyter notebook and import the Pandas library using the import pandas as pd
naming convention, then use read_csv()
to import a dataset. I’m importing a Google Analytics data export that includes the number of pageviews by date across a range of browsers, sources, mediums, user types, and device types. After importing the data we’ll slugify the column header names to tidy up the dataframe.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv',
parse_dates=['Date'])
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
df.head()
user_type | source | medium | browser | device_category | date | pageviews | |
---|---|---|---|---|---|---|---|
0 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-31 | 3 |
1 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-14 | 1 |
2 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-07-14 | 1 |
3 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-07 | 1 |
4 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-12 | 1 |
Next, we’ll use the Pandas info()
method to determine how much memory the dataframe is using. To do this, we call df.info(memory_usage='deep')
. Pandas returns a dataframe telling us that all the columns are of the object
data type, apart from the pageviews
column which contains an int64
. The whole dataframe is using 3.8 MB of memory.
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_type 10000 non-null object
1 source 10000 non-null object
2 medium 10000 non-null object
3 browser 10000 non-null object
4 device_category 10000 non-null object
5 date 10000 non-null datetime64[ns]
6 pageviews 10000 non-null int64
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 3.2 MB
To identify the worst offending columns we can use the Pandas memory_usage()
function with the deep=True
argument. This shows us a list of the columns and their memory usage, making it obvious that the datetime64[ns]
and pageviews
columns are stored very efficiently, but the object
dtype columns are not.
df.memory_usage(deep=True)
Index 128
user_type 680000
source 634832
medium 637573
browser 634986
device_category 634882
date 80000
pageviews 80000
dtype: int64
First, we’ll look at the simplest way to reduce memory usage. That’s simply to reduce the number of columns by creating a subset containing only those of interest. We can do that very easily by passing in a list of the column names to df[]
and then reassigning the transformed dataframe to df_subset
.
Obviously, this only works when you don’t actually need those other columns… Re-running df_subset.info(memory_usage='deep')
shows us that we’ve reduced memory usage from 3.8 MB to 1.9 MB.
df_subset = df[['user_type', 'source', 'medium', 'pageviews']]
df_subset.head()
user_type | source | medium | pageviews | |
---|---|---|---|---|
0 | New Visitor | (direct) | (none) | 3 |
1 | New Visitor | (direct) | (none) | 1 |
2 | New Visitor | (direct) | (none) | 1 |
3 | New Visitor | (direct) | (none) | 1 |
4 | New Visitor | (direct) | (none) | 1 |
df_subset.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_type 10000 non-null object
1 source 10000 non-null object
2 medium 10000 non-null object
3 pageviews 10000 non-null int64
dtypes: int64(1), object(3)
memory usage: 1.9 MB
Thankfully, there is a much better way to reduce memory usage in Pandas, and it doesn’t cause any data loss. As we saw above, the original dataframe contains mostly string values stored as the object
dtype.
However, Pandas includes another data type called the CategoricalDtype
that allows you to store categorical data in a much more efficient manner, providing it is of fairly low cardinality relative to the number of rows in the dataframe.
df.head()
user_type | source | medium | browser | device_category | date | pageviews | |
---|---|---|---|---|---|---|---|
0 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-31 | 3 |
1 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-14 | 1 |
2 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-07-14 | 1 |
3 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-07 | 1 |
4 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-12 | 1 |
Cardinality is basically a fancy way of saying how many unique values exist within a given column. We can measure the cardinality of columns in Pandas dataframes in several ways. We can count the unique values using nunique()
and we can view the unique values and their counts using value_counts()
.
Running df.shape
shows us that we have 10,000 rows and 7 columns, while applying nunique()
to each object
column shows us we have fairly low cardinality, with 1, 3, 4, 17, and 19 unique values per column.
df.shape
(10000, 7)
print('user_type = ', df['user_type'].nunique())
print('device_category = ', df['device_category'].nunique())
print('medium = ', df['medium'].nunique())
print('browser = ', df['browser'].nunique())
print('source = ', df['source'].nunique())
user_type = 1
device_category = 3
medium = 4
browser = 17
source = 19
Finally, we can use the Pandas assign()
method to cast the data type on the object
columns to category
using the Pandas astype()
function. We’ll do this for all columns with object
data and then save the output back to df
. Another really cool feature of category
data types is that you can even apply a custom sort order using CategoricalDtype
.
df = df.assign(
user_type = df['user_type'].astype('category'),
source = df['source'].astype('category'),
medium = df['medium'].astype('category'),
browser = df['browser'].astype('category'),
device_category = df['device_category'].astype('category'),
)
If you re-run df.info(memory_usage='deep')
on the transformed dataframe you’ll see that our object
columns now show with the category
dtype and the memory usage has dropped from 3.2MB to a mere 209.6 KB. We still have the same data present and nothing has been lost, but we’ve massively improved performance. The categorical columns now use less memory than the datetime64[ns]
and int64
columns!
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_type 10000 non-null category
1 source 10000 non-null category
2 medium 10000 non-null category
3 browser 10000 non-null category
4 device_category 10000 non-null category
5 date 10000 non-null datetime64[ns]
6 pageviews 10000 non-null int64
dtypes: category(5), datetime64[ns](1), int64(1)
memory usage: 209.6 KB
df.memory_usage(deep=True)
Index 128
user_type 10176
source 11904
medium 10426
browser 11733
device_category 10298
date 80000
pageviews 80000
dtype: int64
Matt Clarke, Sunday, January 01, 2023