Remove
duplicatesA Python list contains duplicate elements. How can we remove them? Some approaches may lead to the elements becoming reordered.
It is possible to use the set built-in—this code is simpler but changes order. To preserve order, we use a for
-loop and check a set and append to a new list.
Here we use the set()
built in to remove duplicates. This code is simpler, but it may reorder elements in some Python implementations.
string
list into a set—and sets must have only unique elements. So this removes duplicates.set()
reorders elements.# Our input list. values = ["bird", "bird", "fish"] # Part 1: convert to a set. set = set(values) # Part 2: convert the set back into a list. result = list(set) print(result)['fish', 'bird']
We introduce the remove_duplicates
method. It receives a list and loops over its values. It maintains 2 collections: an output list and a set.
def remove_duplicates(values): output = [] seen = set() for value in values: # If value has not been encountered yet, # ... add it to both list and set. if value not in seen: output.append(value) seen.add(value) return output # Remove duplicates from this list. values = [5, 5, 1, 1, 2] result = remove_duplicates(values) print(result)[5, 1, 2]
This code checks a list for duplicates, but does not remove anything. It uses a nested loop. The inner for
-loop only checks the following elements, not the preceding ones.
def has_duplicates(values): # For each element, check all following elements for a duplicate. for i in range(0, len(values)): for x in range(i + 1, len(values)): if values[i] == values[x]: return True return False # Test the has_duplicates method. print(has_duplicates([10, 20, 30, 40])) print(has_duplicates([1, 2, 3, 1, 2])) print(has_duplicates([40, 30, 20, 40])) print(has_duplicates(["cat", "dog", "bird", "dog"])) print(has_duplicates([None, 0, 1, 2]))False True True True False
Should we ever check before removing duplicates? On a six-element list, we test the performance of has_duplicates
.
for
-loops to check was faster than using the set built-in.import time def has_duplicates(values): # Same as above example. for i in range(0, len(values)): for x in range(i + 1, len(values)): if values[i] == values[x]: return True return False # Contains no duplicates. elements = [100, 200, 300, 400, 500, 600] print(time.time()) # Version 1: test before using set. for i in range(0, 10000000): if has_duplicates(elements): unique = list(set(elements)) else: unique = elements print(time.time()) # Version 2: always use set. for i in range(0, 10000000): unique = list(set(elements)) print(time.time())1420055657.667 1420055657.838 has_duplicates: 0.171 s [PyPy] 1420055658.061 set, list: 0.223 s
Programs often can do the same thing in different ways. Some approaches may be more efficient. Others may be more concise or easy to read.
We removed duplicates from lists. For removing duplicates from a list, a custom method may be needed to retain ordering of elements.