How to identify and remove duplicate values in Pandas

Learn how to identify duplicate Pandas column values and rows using duplicated() and remove or de-dupe them using the drop_duplicates() method.

How to identify and remove duplicate values in Pandas
Picture by Vojtech Okenka, Pexels.
22 minutes to read

Duplicate values are a common occurrence in data science, and they come in various forms. Not only will you need to be able to identify duplicate values, but you will also need to be able to remove them from your data using a process known as de-duplication or de-duping.

The different types of data duplication

There are three main types of data duplication:

  • Exact duplicates are rows that contain the same values in all columns.
  • Partial duplicates are rows that contain the same values in some columns.
  • Duplicate keys are rows that contain the same values in one or more columns, but not all columns.

In this post, you will learn how to identify duplicate values using the duplicated() method and how to remove them using the drop_duplicates() method. We’ll handle everything from rows that are completely duplicated (exact duplicates), to rows that include duplicate values in just one column (duplicate keys), and those that include duplicate values in multiple columns (partial duplicates).

Import the packages

To get started, you will need to open a new Jupyter Notebook and import the pandas package. You will also need to import the pandas package as pd to make it easier to reference later on.

import pandas as pd

Create a dataframe containing some duplicate values

Next, you can either import your data into Pandas or you can create a new dataframe containing some duplicate values. The example here creates various types of duplicate values.

# Create a dataframe
data = [{'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3}, 
        {'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
        {'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
        {'species': 'Esox lucius', 'length': 120, 'weight': 7.7, 'age': 2},
        {'species': 'Esox lucius', 'length': 110, 'weight': 7.9, 'age': 2},
        {'species': 'Cyprinus carpio', 'length': 56, 'weight': 8.3, 'age': 13},
        {'species': 'Cyprinus carpio', 'length': 36, 'weight': 7.9, 'age': 23},
        {'species': 'Cyprinus carpio', 'length': 46, 'weight': 8.1, 'age': 13},
        {'species': 'Cyprinus carpio', 'length': 46, 'weight': 8.1, 'age': 13},
        {'species': 'Salmo trutta', 'length': 40, 'weight': 7.5, 'age': 5},
        {'species': 'Salmo trutta', 'length': 38, 'weight': 7.4, 'age': 4},
        {'species': 'Oncorhynchus mykiss', 'length': 42, 'weight': 7.6, 'age': 5},
        {'species': 'Oncorhynchus mykiss', 'length': 42, 'weight': 7.6, 'age': 5},
        {'species': 'Salmo salar', 'length': 44, 'weight': 7.7, 'age': 5}]

df = pd.DataFrame(data)
df
species length weight age
0 Esox lucius 120 8.1 3
1 Esox lucius 120 8.1 3
2 Esox lucius 120 8.1 3
3 Esox lucius 120 7.7 2
4 Esox lucius 110 7.9 2
5 Cyprinus carpio 56 8.3 13
6 Cyprinus carpio 36 7.9 23
7 Cyprinus carpio 46 8.1 13
8 Cyprinus carpio 46 8.1 13
9 Salmo trutta 40 7.5 5
10 Salmo trutta 38 7.4 4
11 Oncorhynchus mykiss 42 7.6 5
12 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Use duplicated() to return a boolean series indicating whether a row is a duplicate

First, we’ll look at the duplicated() method. This method returns a boolean series indicating whether a row is a duplicate. The default behavior is to return True if the row is a duplicate of a previous row. Note that this just returns a series by default, with the numbers of the rows as the index.

By default, duplicated() considers the entire row to be a duplicate if all the values in the row are the same. It also considers the first row to be unique, so the first row will always be False, since it doesn’t become a duplicate until the next occurrence is encountered.

df.duplicated()
0     False
1      True
2      True
3     False
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13    False
dtype: bool

Use duplicated() to add a new column stating whether the row is a duplicate

By default, the duplicated() method returns a boolean series. This isn’t always very useful. Instead, you may wish to add a new column to the dataframe that states whether the row is a duplicate.

df2 = df.copy()
df2['duplicated'] = df2.duplicated()
df2
species length weight age duplicated
0 Esox lucius 120 8.1 3 False
1 Esox lucius 120 8.1 3 True
2 Esox lucius 120 8.1 3 True
3 Esox lucius 120 7.7 2 False
4 Esox lucius 110 7.9 2 False
5 Cyprinus carpio 56 8.3 13 False
6 Cyprinus carpio 36 7.9 23 False
7 Cyprinus carpio 46 8.1 13 False
8 Cyprinus carpio 46 8.1 13 True
9 Salmo trutta 40 7.5 5 False
10 Salmo trutta 38 7.4 4 False
11 Oncorhynchus mykiss 42 7.6 5 False
12 Oncorhynchus mykiss 42 7.6 5 True
13 Salmo salar 44 7.7 5 False

Use keep=’first’ to keep the first occurrence of a duplicate

By default, duplicated() considers the first row to be unique, so the first row will always be False, since it doesn’t become a duplicate until the next occurrence is encountered. This is the same as setting the keep parameter to 'first', so the code here will do exactly the same thing, but it is a bit clearer to be explicit about it.

df2['duplicated_first'] = df2.duplicated(keep='first')
df2
species length weight age duplicated duplicated_first
0 Esox lucius 120 8.1 3 False False
1 Esox lucius 120 8.1 3 True False
2 Esox lucius 120 8.1 3 True True
3 Esox lucius 120 7.7 2 False False
4 Esox lucius 110 7.9 2 False False
5 Cyprinus carpio 56 8.3 13 False False
6 Cyprinus carpio 36 7.9 23 False False
7 Cyprinus carpio 46 8.1 13 False False
8 Cyprinus carpio 46 8.1 13 True False
9 Salmo trutta 40 7.5 5 False False
10 Salmo trutta 38 7.4 4 False False
11 Oncorhynchus mykiss 42 7.6 5 False False
12 Oncorhynchus mykiss 42 7.6 5 True False
13 Salmo salar 44 7.7 5 False False

Use keep=’last’ to keep the last occurrence of a duplicate

If instead you wish to keep the last occurrence of a duplicate, you can set the keep parameter to 'last'. This will mark the last occurrence of a duplicate as False, and all the others as True.

df2['duplicated_last'] = df2.duplicated(keep='last')
df2
species length weight age duplicated duplicated_first duplicated_last
0 Esox lucius 120 8.1 3 False False False
1 Esox lucius 120 8.1 3 True False False
2 Esox lucius 120 8.1 3 True True False
3 Esox lucius 120 7.7 2 False False False
4 Esox lucius 110 7.9 2 False False False
5 Cyprinus carpio 56 8.3 13 False False False
6 Cyprinus carpio 36 7.9 23 False False False
7 Cyprinus carpio 46 8.1 13 False False False
8 Cyprinus carpio 46 8.1 13 True False False
9 Salmo trutta 40 7.5 5 False False False
10 Salmo trutta 38 7.4 4 False False False
11 Oncorhynchus mykiss 42 7.6 5 False False False
12 Oncorhynchus mykiss 42 7.6 5 True False False
13 Salmo salar 44 7.7 5 False False False

Find duplicates based on a single column with subset

In the default example, duplicated() is looking at the entire row to determine if it is a duplicate. If you want to find duplicates based on a single column, you can use the subset parameter. For example, if you want to find duplicates based on the species column, you can do the following. You can, of course, also combine this with the keep parameter to determine which duplicates to keep.

df.duplicated(subset=['species'])
df2.duplicated(subset=['species'])
0     False
1      True
2      True
3      True
4      True
5     False
6      True
7      True
8      True
9     False
10     True
11    False
12     True
13    False
dtype: bool

Find duplicates based on multiple columns with subset

You may also want to find duplicates based on multiple columns. For example, if you want to find duplicates based on the species and weight columns, you can do the following. You can also combine this with the keep parameter to determine which duplicates to keep.

df2.duplicated(subset=['species', 'length'])
0     False
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13    False
dtype: bool

De-duplicate the data using drop_duplicates()

If you want to remove the duplicate values from your data, you can use the drop_duplicates() method. This is known as de-duplicating or de-duping your data. The drop_duplicates() method works just like the duplicated() method, except that it returns a new DataFrame with the duplicate rows removed. You can use the keep parameter to determine which duplicates to keep. The default is to keep the first duplicate and remove the rest. You can also use the subset parameter to determine which columns to use to determine duplicates.

To refresh our memory on the original dataframe, let’s take a look at it again.

df
species length weight age
0 Esox lucius 120 8.1 3
1 Esox lucius 120 8.1 3
2 Esox lucius 120 8.1 3
3 Esox lucius 120 7.7 2
4 Esox lucius 110 7.9 2
5 Cyprinus carpio 56 8.3 13
6 Cyprinus carpio 36 7.9 23
7 Cyprinus carpio 46 8.1 13
8 Cyprinus carpio 46 8.1 13
9 Salmo trutta 40 7.5 5
10 Salmo trutta 38 7.4 4
11 Oncorhynchus mykiss 42 7.6 5
12 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Drop duplicate rows with matching values in all columns

Let’s say we want to remove all duplicate rows from our data. We can do this by using the drop_duplicates() method without any parameters. This will remove all duplicate rows from our data where the values are the same in all columns. By default, it will keep the first occurrence and remove the rest.

df1 = df.drop_duplicates()
df1
species length weight age
0 Esox lucius 120 8.1 3
3 Esox lucius 120 7.7 2
4 Esox lucius 110 7.9 2
5 Cyprinus carpio 56 8.3 13
6 Cyprinus carpio 36 7.9 23
7 Cyprinus carpio 46 8.1 13
9 Salmo trutta 40 7.5 5
10 Salmo trutta 38 7.4 4
11 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Drop duplicate rows with matching values in a single column

Next, let’s say we want to remove all duplicate rows from our data where the values are the same in a single column. We can do this by using the drop_duplicates() method and specifying the subset parameter. This will remove all duplicate rows from our data where the values are the same in the species column. By default, it will keep the first occurrence and remove the rest.

df2 = df.drop_duplicates(subset=['species'])
df2
species length weight age
0 Esox lucius 120 8.1 3
5 Cyprinus carpio 56 8.3 13
9 Salmo trutta 40 7.5 5
11 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Drop duplicate rows with matching values in a single column and keep the last row

If you want to keep the last occurrence instead of the first, you can use the keep parameter. This will remove all duplicate rows from our data where the values are the same in the species column. By default, it will keep the first occurrence and remove the rest. By setting the keep parameter to last, it will keep the last occurrence and remove the rest.

df2_last = df.drop_duplicates(subset=['species'], keep='last')
df2_last
species length weight age
4 Esox lucius 110 7.9 2
8 Cyprinus carpio 46 8.1 13
10 Salmo trutta 38 7.4 4
12 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Drop duplicates based on multiple columns with subset

You’re not limited to de-duping data based on a single column or the entire row, you can also specify multiple columns to use for de-duping. This is done by passing a list of column names to the subset parameter. This will remove all duplicate rows from our data where the values are the same in the species and length columns. By default, it will keep the first occurrence and remove the rest.

df3 = df.drop_duplicates(subset=['species', 'length'])
df3
species length weight age
0 Esox lucius 120 8.1 3
4 Esox lucius 110 7.9 2
5 Cyprinus carpio 56 8.3 13
6 Cyprinus carpio 36 7.9 23
7 Cyprinus carpio 46 8.1 13
9 Salmo trutta 40 7.5 5
10 Salmo trutta 38 7.4 4
11 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Drop duplicates using inplace

Finally, there’s the inplace parameter. This will modify the original DataFrame instead of returning a new one. This is useful if you don’t need the original DataFrame anymore and want to save some memory. By default, it’s set to False which means it will return a new DataFrame. If you set it to True, it will modify the original DataFrame.

df.drop_duplicates(inplace=True)
df
species length weight age
0 Esox lucius 120 8.1 3
3 Esox lucius 120 7.7 2
4 Esox lucius 110 7.9 2
5 Cyprinus carpio 56 8.3 13
6 Cyprinus carpio 36 7.9 23
7 Cyprinus carpio 46 8.1 13
9 Salmo trutta 40 7.5 5
10 Salmo trutta 38 7.4 4
11 Oncorhynchus mykiss 42 7.6 5
13 Salmo salar 44 7.7 5

Matt Clarke, Saturday, November 12, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.