The Pandas assign()
function is used to create new columns in a dataframe, usually based on calculations. The assign()
function takes the name of the new column to create along with the value to assign, which can come from a calculation of existing dataframe columns or from a lambda
function.
The assign()
function returns a new dataframe with the new column added and the original dataframe is not modified, so you need to save the output back to the original dataframe variable to retain the new column. In this simple tutorial, I’ll show how you can use the Pandas assign()
method to avoid the Pandas SettingWithCopyWarning
error and use the function in method chaining code.
To get started, open a Jupyter notebook and import the Pandas library using the pandas as pd
naming convention, then create a Pandas dataframe. This one contains the scientific name of some fish species and their lengths in centimetres. We’ll perform some calculations on these lengths in a second.
import pandas as pd
df = pd.DataFrame(
[('Pterophyllum altum', 12.56),
('Pterophyllum scalare', 11.82),
('Pterophyllum leopoldi', 14.23)],
columns=['species', 'length_cm']
)
df
species | length_cm | |
---|---|---|
0 | Pterophyllum altum | 12.56 |
1 | Pterophyllum scalare | 11.82 |
2 | Pterophyllum leopoldi | 14.23 |
The most common way to use the Pandas assign()
method is to append it to a Pandas dataframe object. The first argument length_mm
will become the name of the new Pandas column, while the =df['length_cm'] * 10
part will take the length_cm
value, multiply it by 10 and assign the value to length_mm
. The assign()
method returns a dataframe containing the new column, so we’ll need to reassign this back to df
to save it.
df = df.assign(length_mm = df['length_cm'] * 10)
df
species | length_cm | length_mm | |
---|---|---|---|
0 | Pterophyllum altum | 12.56 | 125.6 |
1 | Pterophyllum scalare | 11.82 | 118.2 |
2 | Pterophyllum leopoldi | 14.23 | 142.3 |
You can use lambda
functions to achieve the same thing as we saw above. Importantly, if you’re using the modern method chaining approach, you will need to use lambda
functions, otherwise you risk modifying an earlier version of the dataframe, not the current step within your chain.
df = df.assign(length_m = lambda x: x['length_cm'] / 100)
df
species | length_cm | length_mm | length_m | |
---|---|---|---|---|
0 | Pterophyllum altum | 12.56 | 125.6 | 0.1256 |
1 | Pterophyllum scalare | 11.82 | 118.2 | 0.1182 |
2 | Pterophyllum leopoldi | 14.23 | 142.3 | 0.1423 |
The other really neat thing is that you can also call multiple lambda
functions. Here we’ll create new columns to hold the length_in
and length_ft
values for each species, then save them back to the original dataframe.
df = df.assign(length_in = lambda x: x['length_cm'] * 0.393701,
length_ft = lambda x: x['length_cm'] * 0.0328084)
df
species | length_cm | length_mm | length_m | length_in | length_ft | |
---|---|---|---|---|---|---|
0 | Pterophyllum altum | 12.56 | 125.6 | 0.1256 | 4.944885 | 0.412074 |
1 | Pterophyllum scalare | 11.82 | 118.2 | 0.1182 | 4.653546 | 0.387795 |
2 | Pterophyllum leopoldi | 14.23 | 142.3 | 0.1423 | 5.602365 | 0.466864 |
Finally, the coolest way to use the Pandas assign()
method is with the method chaining technique. This modern Pandas programming style can aid code readability and improve performance, so it’s become more popular - despite remaining a somewhat decisive construct that some data scientists really don’t like.
To use the assign()
method you wrap your arguments up in parentheses, which allow you to format what is essentially a one-liner into a readable form using whitespace. In the example below, we’ll create a new dataframe called df_chain
, call the original df
dataframe we created above, and then call a series of lambda
functions to perform our calculations and create new columns in the dataframe.
df_chain = (
df
.assign(length_mm = lambda x: x['length_cm'] * 10,
length_m = lambda x: x['length_cm'] / 100,
length_in = lambda x: x['length_cm'] * 0.393701,
length_ft = lambda x: x['length_cm'] * 0.0328084
)
)
df_chain
species | length_cm | length_mm | length_m | length_in | length_ft | |
---|---|---|---|---|---|---|
0 | Pterophyllum altum | 12.56 | 125.6 | 0.1256 | 4.944885 | 0.412074 |
1 | Pterophyllum scalare | 11.82 | 118.2 | 0.1182 | 4.653546 | 0.387795 |
2 | Pterophyllum leopoldi | 14.23 | 142.3 | 0.1423 | 5.602365 | 0.466864 |
Matt Clarke, Saturday, January 07, 2023