How to identify and change Pandas dtypes using info() and astype()

Picture by Teona Swift, Pexels.

9 minutes to read

Data Science Pandas

Data comes in many forms, from integers and floats, to strings, dates, and timedeltas. These different types of data are known as data types, or in Pandas dtypes, and using the right ones for your Pandas columns can mean more trouble free Python programming.

When creating a Pandas dataframe, or importing existing data into Pandas, you don’t need to explicitly define the Pandas dtypes you want to use. Instead, Pandas will infer these from the data held in the column. Sometimes it gets it right, sometimes it needs a helping hand to change or “cast” the data to a more suitable dtype.

Why does using the correct Pandas dtype matter?

There are various reasons why using the correct dtype matters in Pandas. For example, if you’re working with dates and want to convert 2022-08-25 to “August 25th, 2022”, you won’t be able to do this if the data has been stored as an object, because all of Pandas date and time functionality expects you to use a datetime64 data type.

Similarly, if you’re attempting to calculate the sum of two values, such as “89,000” and “1000”, and those values are stored as object dtypes instead of int64, you’ll end up with “89,0001000” instead of “90000”. Carefully checking your Pandas data is stored in the correct dtype and casting it to the correct one is, therefore, an important step.

It also makes a massive difference to memory usage. Setting the correct dtype can massively reduce Pandas memory usage.

Pandas dtypes versus Python types

The Pandas dtype is essentially an internal construct in Pandas that defines how the data are stored, and how they can be used in various functions. Pandas will check before running most functions to see what dtype the data is stored with, and may throw an error if it encounters the wrong data type.

Confusingly, there are some subtle differences in the terminology Pandas uses for referring to data types compared to Python, and also Numpy. Here’s a quick summary of the Pandas dtypes and the types of data that should be stored within them. In the tutorial below, I’ll show you how to identify the Pandas dtypes used by your Pandas columns and how you can cast them to other data types using astype().

Pandas dtype	Description
`object`	The Pandas `object` data type is used for storing strings or mixed numeric and non-numeric data. It's the equivalent of the `str` or `mixed` types in Python.
`int64`	The Pandas `int64` data type is used for storing integers or whole numbers. It's the equivalent of the `int` type in Python.
`float64`	The Pandas `float64` data type is used for storing decimals or floating point numbers. It's the equivalent of the `float` type in Python.
`bool`	The Pandas `bool` data type is used for Boolean or True or False values. It's the equivalent of the `bool` type in Python.
`datetime64`	The Pandas `datetime64` data type is used for date and time values.
`timedelta[ns]`	The Pandas `timedelta[ns]` data type is used for storing time deltas, that is, the difference between two datetime values.
`category`	The Pandas `category` data type is used for text values only. It's relatively uncommon to see this data type used, as most data can be mixed, and better suits the `object` data type instead.

Load the data

To get started, open a Jupyter notebook, import the Pandas package, and load up a dataset. I’ve created a dataset for you to use that includes a column containing data for each Pandas data type. When this dataset was exported to CSV, I’d set the correct data type for each column. However, Pandas will only infer some of these data types correctly.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/dtypes.csv')

df

	objects	int64s	float64s	bools	categories	datetime64s	current_date	timedeltas
0	Engine	410	3.12	True	Lotus	2022-08-21	2020-01-01	963 days
1	Gearbox	330	4.24	False	Alpine	2022-08-22	2020-01-01	964 days
2	1	572	4.12	True	Porsche	2022-08-23	2020-01-01	965 days
3	34.45	345	3.89	False	Lada	2022-08-23	2020-01-01	965 days

Identifying Pandas dtypes using info()

To determine how well Pandas has inferred the data types in our sample dataframe we can use the info() function. Simply append .info() to the name of your dataframe and Pandas will return a table showing you the column names and their dtypes.

As you can see from our dataframe, some of the values have the correct dtypes, but others need to have their dtypes modified. For example, the categories data is currently stored as an object dtype, the datetime64s and current_date columns are stored as object dtypes instead of datetime64, and the timedeltas column is stored as an object.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   objects       4 non-null      object 
 1   int64s        4 non-null      int64  
 2   float64s      4 non-null      float64
 3   bools         4 non-null      bool   
 4   categories    4 non-null      object 
 5   datetime64s   4 non-null      object 
 6   current_date  4 non-null      object 
 7   timedeltas    4 non-null      object 
dtypes: bool(1), float64(1), int64(1), object(5)
memory usage: 356.0+ bytes

Converting Pandas dtypes using astype()

Since we have some Pandas dataframe columns with the incorrect data type inferred by Pandas we need to convert some of the dtypes to the correct form. To convert the Pandas dtype of our column data we can use the astype() function.

The astype() function converts or “casts” data stored in one data type to be stored in another. For example, a number stored as an object can be cast to an int64 dtype. To use astype(), you simply append it to the Pandas column with the dtype as an argument and save the data back to the column.

df['categories'] = df['categories'].astype('category')
df['datetime64s'] = df['datetime64s'].astype('datetime64')
df['current_date'] = df['current_date'].astype('datetime64')

If you re-run .info(), you’ll be able to check that the casting process has worked correctly. You can now see that the categories column has been cast to the category data type, and the datetime64s and current_date columns have been cast to the datetime64 data type.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype          
---  ------        --------------  -----          
 0   objects       4 non-null      object         
 1   int64s        4 non-null      int64          
 2   float64s      4 non-null      float64        
 3   bools         4 non-null      bool           
 4   categories    4 non-null      category       
 5   datetime64s   4 non-null      datetime64[ns] 
 6   current_date  4 non-null      datetime64[ns] 
 7   timedeltas    4 non-null      timedelta64[ns]
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 520.0+ bytes

Weirdly, you can’t use astype() to cast a value to a timedelta64[ns] dtype. Instead, you need to use the Pandas to_timedelta function via an apply(). Finally, we’ll run this function and re-run info(), which reveals we now have a Pandas dataframe with the correct data types assigned to our columns.

df['timedeltas'] = df['timedeltas'].apply(pd.to_timedelta)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype          
---  ------        --------------  -----          
 0   objects       4 non-null      object         
 1   int64s        4 non-null      int64          
 2   float64s      4 non-null      float64        
 3   bools         4 non-null      bool           
 4   categories    4 non-null      category       
 5   datetime64s   4 non-null      datetime64[ns] 
 6   current_date  4 non-null      datetime64[ns] 
 7   timedeltas    4 non-null      timedelta64[ns]
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 520.0+ bytes

Matt Clarke, Thursday, August 25, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.