Many Pandas functions require data to be stored in the correct data type, or dtype
as it’s known. For example, “£32,232.92” will be recognised as an object
data type because it’s not completely numeric due to the presence of the comma separator and the currency symbol, so Pandas can’t perform calculations upon the values unless you strip out the non-numeric characters first and set the dtype
correctly.
In this simple tutorial, I’ll show you how you can overcome the problem of dirty data and convert non-numeric data to numeric data, so it can be used in calculations using the Pandas to_numeric()
function.
To get started, open a Jupyter notebook and import the Pandas library using the import pandas as pd
naming convention, so you can refer to Pandas using the pd
shorthand. You might also want to add import warnings
and set warnings.filterwarnings('ignore')
to hide the warning that Pandas will return about forthcoming changes to the way in which some functions work.
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
Next, create a Pandas dataframe that contains some values that are stored as strings but should really be stored as numeric data types. We can make a numeric value into a string by wrapping it in quotes, or including other characters to the numbers, such as comma thousands separators or currency symbols, such as pound signs.
df = pd.DataFrame({'order_id': ['1', '2', '3', '4', '5'],
'revenue': ['£27,232.64', '£12,345.67', '£23,456.78', '£34,567.89', '£45,678.90']})
df.head()
order_id | revenue | |
---|---|---|
0 | 1 | £27,232.64 |
1 | 2 | £12,345.67 |
2 | 3 | £23,456.78 |
3 | 4 | £34,567.89 |
4 | 5 | £45,678.90 |
To check the dtypes
Pandas has inferred from your dataframe you can append the info()
method to the dataframe object to return some data. The object
value in the Dtype
column means that our values are currently stored as objects or strings.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 5 non-null object
1 revenue 5 non-null object
dtypes: object(2)
memory usage: 208.0+ bytes
There are actually several ways you can change a string to a numeric value in Pandas. Changing a data type is known as “casting”, so here we’re going to use Pandas to cast the dtype
of the order_id
column to a numeric dtype
. Unlike the astype()
method, you don’t need to explicitly define the dtype
you want to use, as Pandas will infer the correct dtype
from the data stored in the column.
We’ll use the assign()
function to run the order_id
through to_numeric()
and we’ll save the transformed dataframe object. After running the code, if you run df.info()
again you’ll see that Pandas has inferred that the order_id
should be an int64
dtype, so it’s cast the value accordingly.
df = df.assign(order_id = pd.to_numeric(df['order_id']))
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 5 non-null int64
1 revenue 5 non-null object
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes
Next, we’ll use to_numeric()
function to attempt converting the string stored in the revenue column to a numeric value. Unlike the order_id
column, which was just a number encapsulated in quotes, the revenue
column contains a value with a thousands comma separator and a pound sign currency symbol. If we attempt to use assign()
to run to_numeric()
on this column we’ll receive an error stating “ValueError: Unable to parse string “£27,232.64” at position 0”.
df = df.assign(
revenue = pd.to_numeric(df['revenue'])
)
The error we received above is because, despite the name, to_numeric()
doesn’t actually strip out non-numeric characters before attempting to convert them. Therefore, we need to handle this before we pass the value to to_numeric()
.
One easy way to strip non-numeric characters from a string is to use the replace()
function with a Python regular expression to detect non-numeric values. We’ll append this and replace any non-numeric values with nothing.
df = df.assign(
revenue = pd.to_numeric(df['revenue'].str.replace('[^0-9.]', ''))
)
Now when we re-run the assign()
function with our to_numeric()
call, we should see that the non-numeric values have been removed and that Pandas has inferred the dtype
as float64
, thus successfully converting our non-numeric data to a numeric representation we can use in calculations.
df.head()
order_id | revenue | |
---|---|---|
0 | 1 | 27232.64 |
1 | 2 | 12345.67 |
2 | 3 | 23456.78 |
3 | 4 | 34567.89 |
4 | 5 | 45678.90 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 5 non-null int64
1 revenue 5 non-null float64
dtypes: float64(1), int64(1)
memory usage: 208.0 bytes
Matt Clarke, Sunday, January 01, 2023