The Pandas concat()
function is used to concatenate (or join together) two or more Pandas objects such as dataframes or series. It can be used to join two dataframes together vertically or horizontally, or add additional rows or columns.
Pandas concat()
is an important function to learn, since the function usually used for these tasks - the Pandas append()
function - is deprecated since version 1.4.0 and may eventually be removed from Pandas.
This is mildly annoying, as I prefer append()
, but concat()
has the benefit of being a bit quicker as it adds additional data in a single operation, rather than doing it iteratively. That’s going to have a more noticeable effect on larger datasets than those I work on. However, while it’s a bit fiddlier to use, it’s faster and the recommended way to concatenate data, so you need to learn it.
The concat()
function has various optional parameters, but the essential one you need to pass is called objs
and defines the Pandas objects you want to concatenate - whether that’s Series or DataFrame objects. There are also a number of other optional parameters. Here’s a quick summary of what they do and why you might need to use them.
Parameter | Description |
---|---|
objs |
The objs parameter is a list of Series, DataFrames, or Panels to be concatenated together. |
axis |
The axis parameter is the axis to concatenate along. The default is 0 , which concatenates along the rows, while passing 1 will concatenate along the columns. |
join |
The join parameter is the type of join to perform. The default is outer , which performs a full outer join. Other options include inner , which performs an inner join. |
ignore_index |
The ignore_index parameter is a boolean value that determines whether to ignore the index of the individual DataFrames. The default is False , which preserves the index of each DataFrame. If set to True , the index of each DataFrame will be ignored. |
keys |
The keys parameter is a list of values that will be used to create a hierarchical index. The length of the list must match the number of DataFrames being concatenated. |
levels |
The levels parameter is a list of lists that specifies the levels to use for the hierarchical index. The length of the list must match the number of DataFrames being concatenated. |
names |
The names parameter is a list of values that will be used to name the hierarchical index. The length of the list must match the number of DataFrames being concatenated. |
verify_integrity |
The verify_integrity parameter is a boolean value that determines whether to check for duplicates in the concatenated DataFrame. The default is False , which does not check for duplicates. If set to True , an exception will be raised if there are duplicates. |
sort |
The sort parameter is a boolean value that determines whether to sort the concatenated DataFrame. The default is False , which does not sort the DataFrame. If set to True , the DataFrame will be sorted. |
To get started, open a Jupyter notebook and import the Pandas library using the as pd
naming convention.
import pandas as pd
First, we’ll use concat()
to concatenate or join together two dataframes vertically. We’ll create two dataframes that contain the same column names.
df1 = pd.DataFrame(
[('Pterophyllum altum', 3, 12.5, 13.3),
('Pterophyllum scalare', 2, 10.0, 11.0),
('Pterophyllum leopoldi', 1, 8.0, 9.0)],
columns=['species', 'age', 'length', 'weight']
)
df1
species | age | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | 3 | 12.5 | 13.3 |
1 | Pterophyllum scalare | 2 | 10.0 | 11.0 |
2 | Pterophyllum leopoldi | 1 | 8.0 | 9.0 |
df2 = pd.DataFrame(
[('Vieja synspila', 2, 10.0, 11.0),
('Altolamprologus calvus', 1, 8.0, 9.0)],
columns=['species', 'age', 'length', 'weight']
)
df2
species | age | length | weight | |
---|---|---|---|---|
0 | Vieja synspila | 2 | 10.0 | 11.0 |
1 | Altolamprologus calvus | 1 | 8.0 | 9.0 |
To concatenate the dataframes vertically (i.e. one below the other), we can use the concat function with no additional arguments. We’ll pass the objs
parameter a list of the dataframes we want to concatenate. Since the column names are the same, we end up with a neat dataframe with df1
stacked vertically on top of df2
beneath.
df3 = pd.concat([df1, df2])
df3
species | age | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | 3 | 12.5 | 13.3 |
1 | Pterophyllum scalare | 2 | 10.0 | 11.0 |
2 | Pterophyllum leopoldi | 1 | 8.0 | 9.0 |
0 | Vieja synspila | 2 | 10.0 | 11.0 |
1 | Altolamprologus calvus | 1 | 8.0 | 9.0 |
If you look closely at the example above, you’ll spot that the index doesn’t look quite right, since it contains several 0 and 1 values from concatenating the two dataframes. To fix this, you can use the ignore_index parameter to create a new index. Now we get back a neater dataframe in which the index is reset to appear in the correct order.
df4 = pd.concat([df1, df2], ignore_index=True)
df4
species | age | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | 3 | 12.5 | 13.3 |
1 | Pterophyllum scalare | 2 | 10.0 | 11.0 |
2 | Pterophyllum leopoldi | 1 | 8.0 | 9.0 |
3 | Vieja synspila | 2 | 10.0 | 11.0 |
4 | Altolamprologus calvus | 1 | 8.0 | 9.0 |
The concat()
function can also be used to add a new row to a DataFrame. There are a couple of ways to do this. The first way is to create a new dataframe with the same columns and index with a single row, and then concatenate it vertically using the ignore_index=True
argument.
new_row = pd.DataFrame(
[('Pterophyllum altum', 4, 14.0, 15.0)],
columns=['species', 'age', 'length', 'weight']
)
df5 = pd.concat([df4, new_row], ignore_index=True)
df5
species | age | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | 3 | 12.5 | 13.3 |
1 | Pterophyllum scalare | 2 | 10.0 | 11.0 |
2 | Pterophyllum leopoldi | 1 | 8.0 | 9.0 |
3 | Vieja synspila | 2 | 10.0 | 11.0 |
4 | Altolamprologus calvus | 1 | 8.0 | 9.0 |
5 | Pterophyllum altum | 4 | 14.0 | 15.0 |
The concat()
method can also be used to add a new column to a dataframe. To do this, we need to create a new series using pd.Series()
and then concatenate it to the other dataframe. When calling pd.Series()
you need to ensure you provide the correct number of values, and define the name
parameter to give the new column a name.
df6 = pd.concat([df5, pd.Series([5, 9, 23, 4, 9, 23], name='price')], axis=1)
df6
species | age | length | weight | price | |
---|---|---|---|---|---|
0 | Pterophyllum altum | 3 | 12.5 | 13.3 | 5 |
1 | Pterophyllum scalare | 2 | 10.0 | 11.0 | 9 |
2 | Pterophyllum leopoldi | 1 | 8.0 | 9.0 | 23 |
3 | Vieja synspila | 2 | 10.0 | 11.0 | 4 |
4 | Altolamprologus calvus | 1 | 8.0 | 9.0 | 9 |
5 | Pterophyllum altum | 4 | 14.0 | 15.0 | 23 |
Matt Clarke, Sunday, November 27, 2022