Duplicate values are a common occurrence in data science, and they come in various forms. Not only will you need to be able to identify duplicate values, but you will also need to be able to remove them from your data using a process known as de-duplication or de-duping.
There are three main types of data duplication:
In this post, you will learn how to identify duplicate values using the duplicated()
method and how to remove them using the drop_duplicates()
method. We’ll handle everything from rows that are completely duplicated (exact duplicates), to rows that include duplicate values in just one column (duplicate keys), and those that include duplicate values in multiple columns (partial duplicates).
To get started, you will need to open a new Jupyter Notebook and import the pandas
package. You will also need to import the pandas
package as pd
to make it easier to reference later on.
import pandas as pd
Next, you can either import your data into Pandas or you can create a new dataframe containing some duplicate values. The example here creates various types of duplicate values.
# Create a dataframe
data = [{'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
{'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
{'species': 'Esox lucius', 'length': 120, 'weight': 8.1, 'age': 3},
{'species': 'Esox lucius', 'length': 120, 'weight': 7.7, 'age': 2},
{'species': 'Esox lucius', 'length': 110, 'weight': 7.9, 'age': 2},
{'species': 'Cyprinus carpio', 'length': 56, 'weight': 8.3, 'age': 13},
{'species': 'Cyprinus carpio', 'length': 36, 'weight': 7.9, 'age': 23},
{'species': 'Cyprinus carpio', 'length': 46, 'weight': 8.1, 'age': 13},
{'species': 'Cyprinus carpio', 'length': 46, 'weight': 8.1, 'age': 13},
{'species': 'Salmo trutta', 'length': 40, 'weight': 7.5, 'age': 5},
{'species': 'Salmo trutta', 'length': 38, 'weight': 7.4, 'age': 4},
{'species': 'Oncorhynchus mykiss', 'length': 42, 'weight': 7.6, 'age': 5},
{'species': 'Oncorhynchus mykiss', 'length': 42, 'weight': 7.6, 'age': 5},
{'species': 'Salmo salar', 'length': 44, 'weight': 7.7, 'age': 5}]
df = pd.DataFrame(data)
df
species | length | weight | age | |
---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 |
1 | Esox lucius | 120 | 8.1 | 3 |
2 | Esox lucius | 120 | 8.1 | 3 |
3 | Esox lucius | 120 | 7.7 | 2 |
4 | Esox lucius | 110 | 7.9 | 2 |
5 | Cyprinus carpio | 56 | 8.3 | 13 |
6 | Cyprinus carpio | 36 | 7.9 | 23 |
7 | Cyprinus carpio | 46 | 8.1 | 13 |
8 | Cyprinus carpio | 46 | 8.1 | 13 |
9 | Salmo trutta | 40 | 7.5 | 5 |
10 | Salmo trutta | 38 | 7.4 | 4 |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
12 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
First, we’ll look at the duplicated()
method. This method returns a boolean series indicating whether a row is a duplicate. The default behavior is to return True
if the row is a duplicate of a previous row. Note that this just returns a series by default, with the numbers of the rows as the index.
By default, duplicated()
considers the entire row to be a duplicate if all the values in the row are the same. It also considers the first row to be unique, so the first row will always be False
, since it doesn’t become a duplicate until the next occurrence is encountered.
df.duplicated()
0 False
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
12 True
13 False
dtype: bool
By default, the duplicated()
method returns a boolean series. This isn’t always very useful. Instead, you may wish to add a new column to the dataframe that states whether the row is a duplicate.
df2 = df.copy()
df2['duplicated'] = df2.duplicated()
df2
species | length | weight | age | duplicated | |
---|---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 | False |
1 | Esox lucius | 120 | 8.1 | 3 | True |
2 | Esox lucius | 120 | 8.1 | 3 | True |
3 | Esox lucius | 120 | 7.7 | 2 | False |
4 | Esox lucius | 110 | 7.9 | 2 | False |
5 | Cyprinus carpio | 56 | 8.3 | 13 | False |
6 | Cyprinus carpio | 36 | 7.9 | 23 | False |
7 | Cyprinus carpio | 46 | 8.1 | 13 | False |
8 | Cyprinus carpio | 46 | 8.1 | 13 | True |
9 | Salmo trutta | 40 | 7.5 | 5 | False |
10 | Salmo trutta | 38 | 7.4 | 4 | False |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 | False |
12 | Oncorhynchus mykiss | 42 | 7.6 | 5 | True |
13 | Salmo salar | 44 | 7.7 | 5 | False |
By default, duplicated()
considers the first row to be unique, so the first row will always be False
, since it doesn’t become a duplicate until the next occurrence is encountered. This is the same as setting the keep
parameter to 'first'
, so the code here will do exactly the same thing, but it is a bit clearer to be explicit about it.
df2['duplicated_first'] = df2.duplicated(keep='first')
df2
species | length | weight | age | duplicated | duplicated_first | |
---|---|---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 | False | False |
1 | Esox lucius | 120 | 8.1 | 3 | True | False |
2 | Esox lucius | 120 | 8.1 | 3 | True | True |
3 | Esox lucius | 120 | 7.7 | 2 | False | False |
4 | Esox lucius | 110 | 7.9 | 2 | False | False |
5 | Cyprinus carpio | 56 | 8.3 | 13 | False | False |
6 | Cyprinus carpio | 36 | 7.9 | 23 | False | False |
7 | Cyprinus carpio | 46 | 8.1 | 13 | False | False |
8 | Cyprinus carpio | 46 | 8.1 | 13 | True | False |
9 | Salmo trutta | 40 | 7.5 | 5 | False | False |
10 | Salmo trutta | 38 | 7.4 | 4 | False | False |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 | False | False |
12 | Oncorhynchus mykiss | 42 | 7.6 | 5 | True | False |
13 | Salmo salar | 44 | 7.7 | 5 | False | False |
If instead you wish to keep the last occurrence of a duplicate, you can set the keep
parameter to 'last'
. This will mark the last occurrence of a duplicate as False
, and all the others as True
.
df2['duplicated_last'] = df2.duplicated(keep='last')
df2
species | length | weight | age | duplicated | duplicated_first | duplicated_last | |
---|---|---|---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 | False | False | False |
1 | Esox lucius | 120 | 8.1 | 3 | True | False | False |
2 | Esox lucius | 120 | 8.1 | 3 | True | True | False |
3 | Esox lucius | 120 | 7.7 | 2 | False | False | False |
4 | Esox lucius | 110 | 7.9 | 2 | False | False | False |
5 | Cyprinus carpio | 56 | 8.3 | 13 | False | False | False |
6 | Cyprinus carpio | 36 | 7.9 | 23 | False | False | False |
7 | Cyprinus carpio | 46 | 8.1 | 13 | False | False | False |
8 | Cyprinus carpio | 46 | 8.1 | 13 | True | False | False |
9 | Salmo trutta | 40 | 7.5 | 5 | False | False | False |
10 | Salmo trutta | 38 | 7.4 | 4 | False | False | False |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 | False | False | False |
12 | Oncorhynchus mykiss | 42 | 7.6 | 5 | True | False | False |
13 | Salmo salar | 44 | 7.7 | 5 | False | False | False |
In the default example, duplicated()
is looking at the entire row to determine if it is a duplicate. If you want to find duplicates based on a single column, you can use the subset
parameter. For example, if you want to find duplicates based on the species
column, you can do the following. You can, of course, also combine this with the keep
parameter to determine which duplicates to keep.
df.duplicated(subset=['species'])
df2.duplicated(subset=['species'])
0 False
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
10 True
11 False
12 True
13 False
dtype: bool
You may also want to find duplicates based on multiple columns. For example, if you want to find duplicates based on the species
and weight
columns, you can do the following. You can also combine this with the keep
parameter to determine which duplicates to keep.
df2.duplicated(subset=['species', 'length'])
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
12 True
13 False
dtype: bool
If you want to remove the duplicate values from your data, you can use the drop_duplicates()
method. This is known as de-duplicating or de-duping your data. The drop_duplicates()
method works just like the duplicated()
method, except that it returns a new DataFrame with the duplicate rows removed. You can use the keep
parameter to determine which duplicates to keep. The default is to keep the first duplicate and remove the rest. You can also use the subset
parameter to determine which columns to use to determine duplicates.
To refresh our memory on the original dataframe, let’s take a look at it again.
df
species | length | weight | age | |
---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 |
1 | Esox lucius | 120 | 8.1 | 3 |
2 | Esox lucius | 120 | 8.1 | 3 |
3 | Esox lucius | 120 | 7.7 | 2 |
4 | Esox lucius | 110 | 7.9 | 2 |
5 | Cyprinus carpio | 56 | 8.3 | 13 |
6 | Cyprinus carpio | 36 | 7.9 | 23 |
7 | Cyprinus carpio | 46 | 8.1 | 13 |
8 | Cyprinus carpio | 46 | 8.1 | 13 |
9 | Salmo trutta | 40 | 7.5 | 5 |
10 | Salmo trutta | 38 | 7.4 | 4 |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
12 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
Let’s say we want to remove all duplicate rows from our data. We can do this by using the drop_duplicates()
method without any parameters. This will remove all duplicate rows from our data where the values are the same in all columns. By default, it will keep the first occurrence and remove the rest.
df1 = df.drop_duplicates()
df1
species | length | weight | age | |
---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 |
3 | Esox lucius | 120 | 7.7 | 2 |
4 | Esox lucius | 110 | 7.9 | 2 |
5 | Cyprinus carpio | 56 | 8.3 | 13 |
6 | Cyprinus carpio | 36 | 7.9 | 23 |
7 | Cyprinus carpio | 46 | 8.1 | 13 |
9 | Salmo trutta | 40 | 7.5 | 5 |
10 | Salmo trutta | 38 | 7.4 | 4 |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
Next, let’s say we want to remove all duplicate rows from our data where the values are the same in a single column. We can do this by using the drop_duplicates()
method and specifying the subset
parameter. This will remove all duplicate rows from our data where the values are the same in the species
column. By default, it will keep the first occurrence and remove the rest.
df2 = df.drop_duplicates(subset=['species'])
df2
species | length | weight | age | |
---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 |
5 | Cyprinus carpio | 56 | 8.3 | 13 |
9 | Salmo trutta | 40 | 7.5 | 5 |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
If you want to keep the last occurrence instead of the first, you can use the keep
parameter. This will remove all duplicate rows from our data where the values are the same in the species
column. By default, it will keep the first occurrence and remove the rest. By setting the keep
parameter to last
, it will keep the last occurrence and remove the rest.
df2_last = df.drop_duplicates(subset=['species'], keep='last')
df2_last
species | length | weight | age | |
---|---|---|---|---|
4 | Esox lucius | 110 | 7.9 | 2 |
8 | Cyprinus carpio | 46 | 8.1 | 13 |
10 | Salmo trutta | 38 | 7.4 | 4 |
12 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
You’re not limited to de-duping data based on a single column or the entire row, you can also specify multiple columns to use for de-duping. This is done by passing a list of column names to the subset
parameter. This will remove all duplicate rows from our data where the values are the same in the species
and length
columns. By default, it will keep the first occurrence and remove the rest.
df3 = df.drop_duplicates(subset=['species', 'length'])
df3
species | length | weight | age | |
---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 |
4 | Esox lucius | 110 | 7.9 | 2 |
5 | Cyprinus carpio | 56 | 8.3 | 13 |
6 | Cyprinus carpio | 36 | 7.9 | 23 |
7 | Cyprinus carpio | 46 | 8.1 | 13 |
9 | Salmo trutta | 40 | 7.5 | 5 |
10 | Salmo trutta | 38 | 7.4 | 4 |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
Finally, there’s the inplace
parameter. This will modify the original DataFrame instead of returning a new one. This is useful if you don’t need the original DataFrame anymore and want to save some memory. By default, it’s set to False
which means it will return a new DataFrame. If you set it to True
, it will modify the original DataFrame.
df.drop_duplicates(inplace=True)
df
species | length | weight | age | |
---|---|---|---|---|
0 | Esox lucius | 120 | 8.1 | 3 |
3 | Esox lucius | 120 | 7.7 | 2 |
4 | Esox lucius | 110 | 7.9 | 2 |
5 | Cyprinus carpio | 56 | 8.3 | 13 |
6 | Cyprinus carpio | 36 | 7.9 | 23 |
7 | Cyprinus carpio | 46 | 8.1 | 13 |
9 | Salmo trutta | 40 | 7.5 | 5 |
10 | Salmo trutta | 38 | 7.4 | 4 |
11 | Oncorhynchus mykiss | 42 | 7.6 | 5 |
13 | Salmo salar | 44 | 7.7 | 5 |
Matt Clarke, Saturday, November 12, 2022