How to dedupe lists in Python with set() and intersection()

Picture by Czapp Árpád, Pexels.

5 minutes to read

Data Science Python

When working with Python lists you’ll often encounter times when you need to remove duplicate values present in a single list, remove duplicates found in multiple lists, or identify the duplicate or unique values in one or more lists.

Python includes a number of features that make deduplicating list values fairly simple. In this quick tutorial I’ll show you how you can use three Python features dict.fromkeys(), set() and intersection() to identify and remove duplicate or unique values in Python lists.

Create some lists containing duplicate values

First, let’s create a couple of lists that contain a few duplicate values, and some values that are unique to each list. As you can see, Macan and Cayenne are found in both lists, but Range Rover is only found in the first and G Wagen and Defender are only found in the second.

first = ['Macan', 'Cayenne', 'Range Rover']
second = ['Macan', 'Cayenne', 'G Wagen', 'Defender']

Identify values found in the first list only using set()

To identify values only found in the first list we can use the Python set() function. We’ll pass the first list to set() then subtract the output of passing the second list to set(), then we’ll cast the output to a list using list().

found_in_first_only = list(set(first) - set(second))

If you print found_in_first_only you’ll get back a list of the values that are unique to the first list.

found_in_first_only

['Range Rover']

Identify values found in the second list only using set()

To identify values only found in the second list we can use set() again but instead subtract set(first) from set(second) to get the values that are unique to the second list only.

found_in_second_only = list(set(second) - set(first))

Printing found_in_second_only reveals that the Defender and G Wagen values are only found in the second list and not the first list.

found_in_second_only

['Defender', 'G Wagen']

Identify items duplicated in both lists using intersection()

To identify items that are duplicated and found in both the first and second lists we can use another Python function called intersection(). Here we’ll append .intersection(set(second)) to set(first) and then cast the output to a list using list().

found_in_both = list(set(first).intersection(set(second)))

Printing found_in_both shows us that the values Cayenne and Macan were duplicated and present in both the first and second lists.

found_in_both

['Cayenne', 'Macan']

Identify unique values found in either list

Another common problem you’ll encounter is identifying, removing, or deduping duplicate values present in a single Python list. This is also pretty easy. First, we’ll use extend() to join the first and second lists together into a continuous list containing some duplicate values.

first.extend(second)

Next we’ll use dict.fromkeys(first) and then cast the output to a list using list() and assign it to deduped. This returns a Python list containing only the unique values with the duplicate values removed.

deduped = list(dict.fromkeys(first))
deduped

['Macan', 'Cayenne', 'Range Rover', 'G Wagen', 'Defender']

Remove duplicates from a list using fromkeys()

Here’s another example of that in action, showing a single list containing duplicate values, that we’re deduping using dict.fromkeys(). It’s a quick and easy way to dedupe a Python list and find the unique values.

cars = ['Maserati', 'Ferrari', 'Porsche', 'Gilbern', 'Bitter', 'Bitter', 'Lotus', 'Lotus']
deduped = list(dict.fromkeys(cars))
deduped

['Maserati', 'Ferrari', 'Porsche', 'Gilbern', 'Bitter', 'Lotus']

Matt Clarke, Saturday, April 23, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.