How to use Pandas to_numeric() to convert strings to numbers

Picture by Black Ice, Pexels.

7 minutes to read

Data Science Pandas

Many Pandas functions require data to be stored in the correct data type, or dtype as it’s known. For example, “£32,232.92” will be recognised as an object data type because it’s not completely numeric due to the presence of the comma separator and the currency symbol, so Pandas can’t perform calculations upon the values unless you strip out the non-numeric characters first and set the dtype correctly.

In this simple tutorial, I’ll show you how you can overcome the problem of dirty data and convert non-numeric data to numeric data, so it can be used in calculations using the Pandas to_numeric() function.

Import the packages

To get started, open a Jupyter notebook and import the Pandas library using the import pandas as pd naming convention, so you can refer to Pandas using the pd shorthand. You might also want to add import warnings and set warnings.filterwarnings('ignore') to hide the warning that Pandas will return about forthcoming changes to the way in which some functions work.

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Create a dataframe

Next, create a Pandas dataframe that contains some values that are stored as strings but should really be stored as numeric data types. We can make a numeric value into a string by wrapping it in quotes, or including other characters to the numbers, such as comma thousands separators or currency symbols, such as pound signs.

df = pd.DataFrame({'order_id': ['1', '2', '3', '4', '5'], 
                   'revenue': ['£27,232.64', '£12,345.67', '£23,456.78', '£34,567.89', '£45,678.90']})
df.head()

	order_id	revenue
0	1	£27,232.64
1	2	£12,345.67
2	3	£23,456.78
3	4	£34,567.89
4	5	£45,678.90

Use info() to view current dtypes

To check the dtypes Pandas has inferred from your dataframe you can append the info() method to the dataframe object to return some data. The object value in the Dtype column means that our values are currently stored as objects or strings.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   order_id  5 non-null      object
 1   revenue   5 non-null      object
dtypes: object(2)
memory usage: 208.0+ bytes

Use to_numeric() to convert a string to a numeric value

There are actually several ways you can change a string to a numeric value in Pandas. Changing a data type is known as “casting”, so here we’re going to use Pandas to cast the dtype of the order_id column to a numeric dtype. Unlike the astype() method, you don’t need to explicitly define the dtype you want to use, as Pandas will infer the correct dtype from the data stored in the column.

We’ll use the assign() function to run the order_id through to_numeric() and we’ll save the transformed dataframe object. After running the code, if you run df.info() again you’ll see that Pandas has inferred that the order_id should be an int64 dtype, so it’s cast the value accordingly.

df = df.assign(order_id = pd.to_numeric(df['order_id']))

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   order_id  5 non-null      int64 
 1   revenue   5 non-null      object
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes

Strip non-numeric characters from the revenue column before converting to numeric

Next, we’ll use to_numeric() function to attempt converting the string stored in the revenue column to a numeric value. Unlike the order_id column, which was just a number encapsulated in quotes, the revenue column contains a value with a thousands comma separator and a pound sign currency symbol. If we attempt to use assign() to run to_numeric() on this column we’ll receive an error stating “ValueError: Unable to parse string “£27,232.64” at position 0”.

df = df.assign(
    revenue = pd.to_numeric(df['revenue'])
)

The error we received above is because, despite the name, to_numeric() doesn’t actually strip out non-numeric characters before attempting to convert them. Therefore, we need to handle this before we pass the value to to_numeric().

One easy way to strip non-numeric characters from a string is to use the replace() function with a Python regular expression to detect non-numeric values. We’ll append this and replace any non-numeric values with nothing.

df = df.assign(
    revenue = pd.to_numeric(df['revenue'].str.replace('[^0-9.]', ''))
)

Now when we re-run the assign() function with our to_numeric() call, we should see that the non-numeric values have been removed and that Pandas has inferred the dtype as float64, thus successfully converting our non-numeric data to a numeric representation we can use in calculations.

df.head()

	order_id	revenue
0	1	27232.64
1	2	12345.67
2	3	23456.78
3	4	34567.89
4	5	45678.90

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   order_id  5 non-null      int64  
 1   revenue   5 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 208.0 bytes

Matt Clarke, Sunday, January 01, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.