`Remove` duplicates

A Python list contains duplicate elements. How can we remove them? Some approaches may lead to the elements becoming reordered.

It is possible to use the set built-in—this code is simpler but changes order. To preserve order, we use a for-loop and check a set and append to a new list.

Set built-in

Here we use the set() built in to remove duplicates. This code is simpler, but it may reorder elements in some Python implementations.

Part 1 We convert the string list into a set—and sets must have only unique elements. So this removes duplicates.

Part 2 We convert the set back into a list. In these two conversions, the elements are reordered (sorted).

Important Note how the "fish" is placed before the "bird" in the resulting list—set() reorders elements.

# Our input list.
values = ["bird", "bird", "fish"]

# Part 1: convert to a set.
set = set(values)

# Part 2: convert the set back into a list.
result = list(set)
print(result)
['fish', 'bird']

Def example

We introduce the remove_duplicates method. It receives a list and loops over its values. It maintains 2 collections: an output list and a set.

Info The set, seen, tracks which elements have already been encountered. Sets have only unique elements.

So We append elements to our list that have not been seen yet. This means all duplicates are removed, but ordering is not affected.

def remove_duplicates(values):
    output = []
    seen = set()
    for value in values:
        # If value has not been encountered yet,
        # ... add it to both list and set.
        if value not in seen:
            output.append(value)
            seen.add(value)
    return output

# Remove duplicates from this list.
values = [5, 5, 1, 1, 2]
result = remove_duplicates(values)
print(result)
[5, 1, 2]

Has duplicates

This code checks a list for duplicates, but does not remove anything. It uses a nested loop. The inner for-loop only checks the following elements, not the preceding ones.

Note This could be a performance disaster on extremely large collections. But for small things, it is effective.

Note 2 For large collections, using a set or dictionary to check for duplicates is a better option.

def has_duplicates(values):
    # For each element, check all following elements for a duplicate.
    for i in range(0, len(values)):
        for x in range(i + 1, len(values)):
            if values[i] == values[x]:
                return True
    return False

# Test the has_duplicates method.
print(has_duplicates([10, 20, 30, 40]))
print(has_duplicates([1, 2, 3, 1, 2]))
print(has_duplicates([40, 30, 20, 40]))
print(has_duplicates(["cat", "dog", "bird", "dog"]))
print(has_duplicates([None, 0, 1, 2]))False
True
True
True
False

Benchmark, has duplicates

Should we ever check before removing duplicates? On a six-element list, we test the performance of has_duplicates.

Version 1 This version of the code checks to see if duplicates exist in the list before removing them.

Version 2 Here we just remove duplicates immediately, without checking to see if any duplicates exist.

Result For a six-element list with no duplicates, using nested for-loops to check was faster than using the set built-in.

However This test assumes no duplicates are ever found. This is a worthwhile optimization if duplicates are rare and lists are small.

import time

def has_duplicates(values):
    # Same as above example.
    for i in range(0, len(values)):
        for x in range(i + 1, len(values)):
            if values[i] == values[x]:
                return True
    return False

# Contains no duplicates.
elements = [100, 200, 300, 400, 500, 600]

print(time.time())

# Version 1: test before using set.
for i in range(0, 10000000):
    if has_duplicates(elements):
        unique = list(set(elements))
    else:
        unique = elements

print(time.time())

# Version 2: always use set.
for i in range(0, 10000000):
    unique = list(set(elements))

print(time.time())1420055657.667
1420055657.838    has_duplicates: 0.171 s  [PyPy]
1420055658.061    set, list:      0.223 s

Some notes

Programs often can do the same thing in different ways. Some approaches may be more efficient. Others may be more concise or easy to read.

A review

We removed duplicates from lists. For removing duplicates from a list, a custom method may be needed to retain ordering of elements.