How to identify and remove duplicate values in Pandas

Picture by Vojtech Okenka, Pexels.

22 minutes to read

Data Science Pandas

Duplicate values are a common occurrence in data science, and they come in various forms. Not only will you need to be able to identify duplicate values, but you will also need to be able to remove them from your data using a process known as de-duplication or de-duping.

The different types of data duplication

There are three main types of data duplication:

Exact duplicates are rows that contain the same values in all columns.
Partial duplicates are rows that contain the same values in some columns.
Duplicate keys are rows that contain the same values in one or more columns, but not all columns.

In this post, you will learn how to identify duplicate values using the duplicated() method and how to remove them using the drop_duplicates() method. We’ll handle everything from rows that are completely duplicated (exact duplicates), to rows that include duplicate values in just one column (duplicate keys), and those that include duplicate values in multiple columns (partial duplicates).

Import the packages

To get started, you will need to open a new Jupyter Notebook and import the pandas package. You will also need to import the pandas package as pd to make it easier to reference later on.

import pandas as pd

Create a dataframe containing some duplicate values

Next, you can either import your data into Pandas or you can create a new dataframe containing some duplicate values. The example here creates various types of duplicate values.

# Create a dataframe
data = [{'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3}, 
        {'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
        {'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
        {'species': 'Esox lucius', 'length': 120, 'weight': 7.7, 'age': 2},
        {'species': 'Esox lucius', 'length': 110, 'weight': 7.9, 'age': 2},
        {'species': 'Cyprinus carpio', 'length': 56, 'weight': 8.3, 'age': 13},
        {'species': 'Cyprinus carpio', 'length': 36, 'weight': 7.9, 'age': 23},
        {'species': 'Cyprinus carpio', 'length': 46, 'weight': 8.1, 'age': 13},
        {'species': 'Cyprinus carpio', 'length': 46, 'weight': 8.1, 'age': 13},
        {'species': 'Salmo trutta', 'length': 40, 'weight': 7.5, 'age': 5},
        {'species': 'Salmo trutta', 'length': 38, 'weight': 7.4, 'age': 4},
        {'species': 'Oncorhynchus mykiss', 'length': 42, 'weight': 7.6, 'age': 5},
        {'species': 'Oncorhynchus mykiss', 'length': 42, 'weight': 7.6, 'age': 5},
        {'species': 'Salmo salar', 'length': 44, 'weight': 7.7, 'age': 5}]

df = pd.DataFrame(data)
df

	species	length	weight	age
0	Esox lucius	120	8.1	3
1	Esox lucius	120	8.1	3
2	Esox lucius	120	8.1	3
3	Esox lucius	120	7.7	2
4	Esox lucius	110	7.9	2
5	Cyprinus carpio	56	8.3	13
6	Cyprinus carpio	36	7.9	23
7	Cyprinus carpio	46	8.1	13
8	Cyprinus carpio	46	8.1	13
9	Salmo trutta	40	7.5	5
10	Salmo trutta	38	7.4	4
11	Oncorhynchus mykiss	42	7.6	5
12	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Use duplicated() to return a boolean series indicating whether a row is a duplicate

First, we’ll look at the duplicated() method. This method returns a boolean series indicating whether a row is a duplicate. The default behavior is to return True if the row is a duplicate of a previous row. Note that this just returns a series by default, with the numbers of the rows as the index.

By default, duplicated() considers the entire row to be a duplicate if all the values in the row are the same. It also considers the first row to be unique, so the first row will always be False, since it doesn’t become a duplicate until the next occurrence is encountered.

df.duplicated()

   False
    True
    True
   False
   False
   False
   False
   False
    True
   False
  False
  False
   True
  False
dtype: bool

Use duplicated() to add a new column stating whether the row is a duplicate

By default, the duplicated() method returns a boolean series. This isn’t always very useful. Instead, you may wish to add a new column to the dataframe that states whether the row is a duplicate.

df2 = df.copy()
df2['duplicated'] = df2.duplicated()
df2

	species	length	weight	age	duplicated
0	Esox lucius	120	8.1	3	False
1	Esox lucius	120	8.1	3	True
2	Esox lucius	120	8.1	3	True
3	Esox lucius	120	7.7	2	False
4	Esox lucius	110	7.9	2	False
5	Cyprinus carpio	56	8.3	13	False
6	Cyprinus carpio	36	7.9	23	False
7	Cyprinus carpio	46	8.1	13	False
8	Cyprinus carpio	46	8.1	13	True
9	Salmo trutta	40	7.5	5	False
10	Salmo trutta	38	7.4	4	False
11	Oncorhynchus mykiss	42	7.6	5	False
12	Oncorhynchus mykiss	42	7.6	5	True
13	Salmo salar	44	7.7	5	False

Use keep=’first’ to keep the first occurrence of a duplicate

By default, duplicated() considers the first row to be unique, so the first row will always be False, since it doesn’t become a duplicate until the next occurrence is encountered. This is the same as setting the keep parameter to 'first', so the code here will do exactly the same thing, but it is a bit clearer to be explicit about it.

df2['duplicated_first'] = df2.duplicated(keep='first')
df2

	species	length	weight	age	duplicated	duplicated_first
0	Esox lucius	120	8.1	3	False	False
1	Esox lucius	120	8.1	3	True	False
2	Esox lucius	120	8.1	3	True	True
3	Esox lucius	120	7.7	2	False	False
4	Esox lucius	110	7.9	2	False	False
5	Cyprinus carpio	56	8.3	13	False	False
6	Cyprinus carpio	36	7.9	23	False	False
7	Cyprinus carpio	46	8.1	13	False	False
8	Cyprinus carpio	46	8.1	13	True	False
9	Salmo trutta	40	7.5	5	False	False
10	Salmo trutta	38	7.4	4	False	False
11	Oncorhynchus mykiss	42	7.6	5	False	False
12	Oncorhynchus mykiss	42	7.6	5	True	False
13	Salmo salar	44	7.7	5	False	False

Use keep=’last’ to keep the last occurrence of a duplicate

If instead you wish to keep the last occurrence of a duplicate, you can set the keep parameter to 'last'. This will mark the last occurrence of a duplicate as False, and all the others as True.

df2['duplicated_last'] = df2.duplicated(keep='last')
df2

	species	length	weight	age	duplicated	duplicated_first	duplicated_last
0	Esox lucius	120	8.1	3	False	False	False
1	Esox lucius	120	8.1	3	True	False	False
2	Esox lucius	120	8.1	3	True	True	False
3	Esox lucius	120	7.7	2	False	False	False
4	Esox lucius	110	7.9	2	False	False	False
5	Cyprinus carpio	56	8.3	13	False	False	False
6	Cyprinus carpio	36	7.9	23	False	False	False
7	Cyprinus carpio	46	8.1	13	False	False	False
8	Cyprinus carpio	46	8.1	13	True	False	False
9	Salmo trutta	40	7.5	5	False	False	False
10	Salmo trutta	38	7.4	4	False	False	False
11	Oncorhynchus mykiss	42	7.6	5	False	False	False
12	Oncorhynchus mykiss	42	7.6	5	True	False	False
13	Salmo salar	44	7.7	5	False	False	False

Find duplicates based on a single column with subset

In the default example, duplicated() is looking at the entire row to determine if it is a duplicate. If you want to find duplicates based on a single column, you can use the subset parameter. For example, if you want to find duplicates based on the species column, you can do the following. You can, of course, also combine this with the keep parameter to determine which duplicates to keep.

df.duplicated(subset=['species'])

df2.duplicated(subset=['species'])

   False
    True
    True
    True
    True
   False
    True
    True
    True
   False
   True
  False
   True
  False
dtype: bool

Find duplicates based on multiple columns with subset

You may also want to find duplicates based on multiple columns. For example, if you want to find duplicates based on the species and weight columns, you can do the following. You can also combine this with the keep parameter to determine which duplicates to keep.

df2.duplicated(subset=['species', 'length'])

   False
    True
    True
    True
   False
   False
   False
   False
    True
   False
  False
  False
   True
  False
dtype: bool

De-duplicate the data using drop_duplicates()

If you want to remove the duplicate values from your data, you can use the drop_duplicates() method. This is known as de-duplicating or de-duping your data. The drop_duplicates() method works just like the duplicated() method, except that it returns a new DataFrame with the duplicate rows removed. You can use the keep parameter to determine which duplicates to keep. The default is to keep the first duplicate and remove the rest. You can also use the subset parameter to determine which columns to use to determine duplicates.

To refresh our memory on the original dataframe, let’s take a look at it again.

df

	species	length	weight	age
0	Esox lucius	120	8.1	3
1	Esox lucius	120	8.1	3
2	Esox lucius	120	8.1	3
3	Esox lucius	120	7.7	2
4	Esox lucius	110	7.9	2
5	Cyprinus carpio	56	8.3	13
6	Cyprinus carpio	36	7.9	23
7	Cyprinus carpio	46	8.1	13
8	Cyprinus carpio	46	8.1	13
9	Salmo trutta	40	7.5	5
10	Salmo trutta	38	7.4	4
11	Oncorhynchus mykiss	42	7.6	5
12	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Drop duplicate rows with matching values in all columns

Let’s say we want to remove all duplicate rows from our data. We can do this by using the drop_duplicates() method without any parameters. This will remove all duplicate rows from our data where the values are the same in all columns. By default, it will keep the first occurrence and remove the rest.

df1 = df.drop_duplicates()
df1

	species	length	weight	age
0	Esox lucius	120	8.1	3
3	Esox lucius	120	7.7	2
4	Esox lucius	110	7.9	2
5	Cyprinus carpio	56	8.3	13
6	Cyprinus carpio	36	7.9	23
7	Cyprinus carpio	46	8.1	13
9	Salmo trutta	40	7.5	5
10	Salmo trutta	38	7.4	4
11	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Drop duplicate rows with matching values in a single column

Next, let’s say we want to remove all duplicate rows from our data where the values are the same in a single column. We can do this by using the drop_duplicates() method and specifying the subset parameter. This will remove all duplicate rows from our data where the values are the same in the species column. By default, it will keep the first occurrence and remove the rest.

df2 = df.drop_duplicates(subset=['species'])
df2

	species	length	weight	age
0	Esox lucius	120	8.1	3
5	Cyprinus carpio	56	8.3	13
9	Salmo trutta	40	7.5	5
11	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Drop duplicate rows with matching values in a single column and keep the last row

If you want to keep the last occurrence instead of the first, you can use the keep parameter. This will remove all duplicate rows from our data where the values are the same in the species column. By default, it will keep the first occurrence and remove the rest. By setting the keep parameter to last, it will keep the last occurrence and remove the rest.

df2_last = df.drop_duplicates(subset=['species'], keep='last')
df2_last

	species	length	weight	age
4	Esox lucius	110	7.9	2
8	Cyprinus carpio	46	8.1	13
10	Salmo trutta	38	7.4	4
12	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Drop duplicates based on multiple columns with subset

You’re not limited to de-duping data based on a single column or the entire row, you can also specify multiple columns to use for de-duping. This is done by passing a list of column names to the subset parameter. This will remove all duplicate rows from our data where the values are the same in the species and length columns. By default, it will keep the first occurrence and remove the rest.

df3 = df.drop_duplicates(subset=['species', 'length'])
df3

	species	length	weight	age
0	Esox lucius	120	8.1	3
4	Esox lucius	110	7.9	2
5	Cyprinus carpio	56	8.3	13
6	Cyprinus carpio	36	7.9	23
7	Cyprinus carpio	46	8.1	13
9	Salmo trutta	40	7.5	5
10	Salmo trutta	38	7.4	4
11	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Drop duplicates using inplace

Finally, there’s the inplace parameter. This will modify the original DataFrame instead of returning a new one. This is useful if you don’t need the original DataFrame anymore and want to save some memory. By default, it’s set to False which means it will return a new DataFrame. If you set it to True, it will modify the original DataFrame.

df.drop_duplicates(inplace=True)
df

	species	length	weight	age
0	Esox lucius	120	8.1	3
3	Esox lucius	120	7.7	2
4	Esox lucius	110	7.9	2
5	Cyprinus carpio	56	8.3	13
6	Cyprinus carpio	36	7.9	23
7	Cyprinus carpio	46	8.1	13
9	Salmo trutta	40	7.5	5
10	Salmo trutta	38	7.4	4
11	Oncorhynchus mykiss	42	7.6	5
13	Salmo salar	44	7.7	5

Matt Clarke, Saturday, November 12, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.