Pandas is extremely versatile and includes a wide range of different methods you can use to add a new column or series to an existing dataframe. Whether you want to assign a constant value or scalar to every row, a specific value based on the individual row, or a new column based on a calculation or other operation, Pandas makes it easy.
In this tutorial I’ll show you a range of techniques you can use to quickly and easily add a column to a Pandas dataframe. We’ll cover manually assigning columns, creating new columns from lists, and using the insert()
, assign()
, and loc
techniques.
To get started, open a Jupyter notebook, import Pandas and create a Pandas dataframe. We’ll use this in the series of examples below and will add new columns to the dataframe using a variety of techniques.
import pandas as pd
df = pd.DataFrame({
'model': ['XF', 'XE', 'XJ'],
'top_speed': [120, 121, 145]
})
df
model | top_speed | |
---|---|---|
0 | XF | 120 |
1 | XE | 121 |
2 | XJ | 145 |
The most common way to add a new column to a Pandas dataframe is simply to declare the new column name and assign either a scalar (a single value that will be applied to every row) or a Python list of values corresponding to each row in the dataframe. In the example below, we’ll define a new column called manufacturer
to which we’ll assign a scalar or constant value, then we’ll define a column called mpg
to which we’ll assign a list of values.
df['manufacturer'] = 'Jaguar'
df['mpg'] = [45, 49, 41]
df
model | top_speed | manufacturer | mpg | |
---|---|---|---|---|
0 | XF | 120 | Jaguar | 45 |
1 | XE | 121 | Jaguar | 49 |
2 | XJ | 145 | Jaguar | 41 |
The Pandas insert()
function allows you to insert or add a new column to a Pandas dataframe based on either a scalar or constant value or a list of values. The benefit of using insert()
over just assigning the values is that you can use the loc
argument to define where the new column sits within the dataframe.
In the example below we’ll using insert()
to add a column called base_price
to which we’ll pass a list of values and we’ll place it in column 2. Since numbers count from 0 in Python, that puts it in the third column, rather than the second.
df.insert(loc=2,
column='base_price',
value=[35.1, 32.2, 53.7])
df
model | top_speed | base_price | manufacturer | mpg | |
---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 45 |
1 | XE | 121 | 32.2 | Jaguar | 49 |
2 | XJ | 145 | 53.7 | Jaguar | 41 |
The Pandas assign()
function works in a similar manner but instead returns a dataframe containing the newly defined column. It can also be assigned a scalar value that you want to add to every row, a list of values to that are assigned to each row in order, and also a lambda function based on a calculated value.
df = df.assign(top_price=[47730, 42345, 122345])
df
model | top_speed | base_price | manufacturer | mpg | top_price | |
---|---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 45 | 47730 |
1 | XE | 121 | 32.2 | Jaguar | 49 | 42345 |
2 | XJ | 145 | 53.7 | Jaguar | 41 | 122345 |
The .loc[]
method also lets you assign a new column or series to a Pandas dataframe. Personally, I find it a bit clunkier than the other methods, so tend to use the others more often. As with the other methods, you can define a scalar value, a list of values, or perform a calculation based on other columns within the dataframe.
df.loc[:,'0_to_60'] = [5.5, 6.2, 4.1]
df
model | top_speed | base_price | manufacturer | mpg | top_price | 0_to_60 | |
---|---|---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 45 | 47730 | 5.5 |
1 | XE | 121 | 32.2 | Jaguar | 49 | 42345 | 6.2 |
2 | XJ | 145 | 53.7 | Jaguar | 41 | 122345 | 4.1 |
You can also use most of these approaches to add a new column to a Pandas dataframe based on a calculation or other function performed on one or more other columns in the dataframe. Ignore the fact that the below calculation is totally nonsense, but here’s how you can do this using the various approaches shown above.
df['calculation'] = df['top_speed'] / df['base_price']
df
model | top_speed | base_price | manufacturer | mpg | top_price | 0_to_60 | calculation | |
---|---|---|---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 45 | 47730 | 5.5 | 3.418803 |
1 | XE | 121 | 32.2 | Jaguar | 49 | 42345 | 6.2 | 3.757764 |
2 | XJ | 145 | 53.7 | Jaguar | 41 | 122345 | 4.1 | 2.700186 |
df.insert(loc=4,
column='calculation2',
value=df['top_speed'] / df['base_price'])
df
model | top_speed | base_price | manufacturer | calculation2 | mpg | top_price | 0_to_60 | calculation | |
---|---|---|---|---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 3.418803 | 45 | 47730 | 5.5 | 3.418803 |
1 | XE | 121 | 32.2 | Jaguar | 3.757764 | 49 | 42345 | 6.2 | 3.757764 |
2 | XJ | 145 | 53.7 | Jaguar | 2.700186 | 41 | 122345 | 4.1 | 2.700186 |
df = df.assign(calculation3=df['top_speed'] / df['base_price'])
df
model | top_speed | base_price | manufacturer | calculation2 | mpg | top_price | 0_to_60 | calculation | calculation3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 3.418803 | 45 | 47730 | 5.5 | 3.418803 | 3.418803 |
1 | XE | 121 | 32.2 | Jaguar | 3.757764 | 49 | 42345 | 6.2 | 3.757764 | 3.757764 |
2 | XJ | 145 | 53.7 | Jaguar | 2.700186 | 41 | 122345 | 4.1 | 2.700186 | 2.700186 |
df.loc[:,'calculation4'] = df['top_speed'] / df['base_price']
df
model | top_speed | base_price | manufacturer | calculation2 | mpg | top_price | 0_to_60 | calculation | calculation3 | calculation4 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | XF | 120 | 35.1 | Jaguar | 3.418803 | 45 | 47730 | 5.5 | 3.418803 | 3.418803 | 3.418803 |
1 | XE | 121 | 32.2 | Jaguar | 3.757764 | 49 | 42345 | 6.2 | 3.757764 | 3.757764 | 3.757764 |
2 | XJ | 145 | 53.7 | Jaguar | 2.700186 | 41 | 122345 | 4.1 | 2.700186 | 2.700186 | 2.700186 |
Matt Clarke, Saturday, November 05, 2022