Data comes in many forms, from integers and floats, to strings, dates, and timedeltas. These different types of data are known as data types, or in Pandas dtypes
, and using the right ones for your Pandas columns can mean more trouble free Python programming.
When creating a Pandas dataframe, or importing existing data into Pandas, you don’t need to explicitly define the Pandas dtypes you want to use. Instead, Pandas will infer these from the data held in the column. Sometimes it gets it right, sometimes it needs a helping hand to change or “cast” the data to a more suitable dtype
.
There are various reasons why using the correct dtype
matters in Pandas. For example, if you’re working with dates and want to convert 2022-08-25
to “August 25th, 2022”, you won’t be able to do this if the data has been stored as an object
, because all of Pandas date and time functionality expects you to use a datetime64
data type.
Similarly, if you’re attempting to calculate the sum of two values, such as “89,000” and “1000”, and those values are stored as object
dtypes instead of int64
, you’ll end up with “89,0001000” instead of “90000”. Carefully checking your Pandas data is stored in the correct dtype
and casting it to the correct one is, therefore, an important step.
It also makes a massive difference to memory usage. Setting the correct dtype can massively reduce Pandas memory usage.
The Pandas dtype
is essentially an internal construct in Pandas that defines how the data are stored, and how they can be used in various functions. Pandas will check before running most functions to see what dtype
the data is stored with, and may throw an error if it encounters the wrong data type.
Confusingly, there are some subtle differences in the terminology Pandas uses for referring to data types compared to Python, and also Numpy. Here’s a quick summary of the Pandas dtypes
and the types of data that should be stored within them. In the tutorial below, I’ll show you how to identify the Pandas dtypes
used by your Pandas columns and how you can cast them to other data types using astype()
.
Pandas dtype | Description |
---|---|
object |
The Pandas object data type is used for storing strings or mixed numeric and non-numeric data. It's the equivalent of the str or mixed types in Python. |
int64 |
The Pandas int64 data type is used for storing integers or whole numbers. It's the equivalent of the int type in Python. |
float64 |
The Pandas float64 data type is used for storing decimals or floating point numbers. It's the equivalent of the float type in Python. |
bool |
The Pandas bool data type is used for Boolean or True or False values. It's the equivalent of the bool type in Python. |
datetime64 |
The Pandas datetime64 data type is used for date and time values. |
timedelta[ns] |
The Pandas timedelta[ns] data type is used for storing time deltas, that is, the difference between two datetime values. |
category |
The Pandas category data type is used for text values only. It's relatively uncommon to see this data type used, as most data can be mixed, and better suits the object data type instead. |
To get started, open a Jupyter notebook, import the Pandas package, and load up a dataset. I’ve created a dataset for you to use that includes a column containing data for each Pandas data type. When this dataset was exported to CSV, I’d set the correct data type for each column. However, Pandas will only infer some of these data types correctly.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/dtypes.csv')
df
objects | int64s | float64s | bools | categories | datetime64s | current_date | timedeltas | |
---|---|---|---|---|---|---|---|---|
0 | Engine | 410 | 3.12 | True | Lotus | 2022-08-21 | 2020-01-01 | 963 days |
1 | Gearbox | 330 | 4.24 | False | Alpine | 2022-08-22 | 2020-01-01 | 964 days |
2 | 1 | 572 | 4.12 | True | Porsche | 2022-08-23 | 2020-01-01 | 965 days |
3 | 34.45 | 345 | 3.89 | False | Lada | 2022-08-23 | 2020-01-01 | 965 days |
To determine how well Pandas has inferred the data types in our sample dataframe we can use the info()
function. Simply append .info()
to the name of your dataframe and Pandas will return a table showing you the column names and their dtypes.
As you can see from our dataframe, some of the values have the correct dtypes, but others need to have their dtypes modified. For example, the categories
data is currently stored as an object
dtype, the datetime64s
and current_date
columns are stored as object
dtypes instead of datetime64
, and the timedeltas
column is stored as an object
.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objects 4 non-null object
1 int64s 4 non-null int64
2 float64s 4 non-null float64
3 bools 4 non-null bool
4 categories 4 non-null object
5 datetime64s 4 non-null object
6 current_date 4 non-null object
7 timedeltas 4 non-null object
dtypes: bool(1), float64(1), int64(1), object(5)
memory usage: 356.0+ bytes
Since we have some Pandas dataframe columns with the incorrect data type inferred by Pandas we need to convert some of the dtypes
to the correct form. To convert the Pandas dtype of our column data we can use the astype()
function.
The astype()
function converts or “casts” data stored in one data type to be stored in another. For example, a number stored as an object
can be cast to an int64
dtype. To use astype()
, you simply append it to the Pandas column with the dtype as an argument and save the data back to the column.
df['categories'] = df['categories'].astype('category')
df['datetime64s'] = df['datetime64s'].astype('datetime64')
df['current_date'] = df['current_date'].astype('datetime64')
If you re-run .info()
, you’ll be able to check that the casting process has worked correctly. You can now see that the categories
column has been cast to the category
data type, and the datetime64s
and current_date
columns have been cast to the datetime64
data type.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objects 4 non-null object
1 int64s 4 non-null int64
2 float64s 4 non-null float64
3 bools 4 non-null bool
4 categories 4 non-null category
5 datetime64s 4 non-null datetime64[ns]
6 current_date 4 non-null datetime64[ns]
7 timedeltas 4 non-null timedelta64[ns]
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 520.0+ bytes
Weirdly, you can’t use astype()
to cast a value to a timedelta64[ns]
dtype. Instead, you need to use the Pandas to_timedelta
function via an apply()
. Finally, we’ll run this function and re-run info()
, which reveals we now have a Pandas dataframe with the correct data types assigned to our columns.
df['timedeltas'] = df['timedeltas'].apply(pd.to_timedelta)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objects 4 non-null object
1 int64s 4 non-null int64
2 float64s 4 non-null float64
3 bools 4 non-null bool
4 categories 4 non-null category
5 datetime64s 4 non-null datetime64[ns]
6 current_date 4 non-null datetime64[ns]
7 timedeltas 4 non-null timedelta64[ns]
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 520.0+ bytes
Matt Clarke, Thursday, August 25, 2022