Remove duplicates

Sometimes duplicates are useful, but often they are not. Duplicates in an ArrayList can be removed. A HashSet that records encountered elements is helpful.

It is often best to remove duplicates as we encounter them. With a Java HashMap, we can maintain a hash of encountered elements.

Method

The removeDuplicates method accepts an ArrayList of Strings. In it, we loop through that argument list. If our HashSet contains the String, we do nothing.

But If our HashSet does not contain the String, we add it to the HashSet (so it is detected next time).

Result We also build up our result ArrayList. This collection will retain the ordering of the original, but not duplicates.

Info RemoveDuplicates() could be easily adapted to work upon Integers or other types. Try changing the String type to Integer.

import java.util.ArrayList;
import java.util.HashSet;

public class Program {

    static ArrayList<String> removeDuplicates(ArrayList<String> list) {

        // Store unique items in result.
        ArrayList<String> result = new ArrayList<>();

        // Record encountered Strings in HashSet.
        HashSet<String> set = new HashSet<>();

        // Loop over argument list.
        for (String item : list) {

            // If String is not in set, add it to the list and the set.
            if (!set.contains(item)) {
                result.add(item);
                set.add(item);
            }
        }
        return result;
    }

    public static void main(String[] args) {

        ArrayList<String> list = new ArrayList<>();
        list.add("dog");
        list.add("cat");
        list.add("dog");
        list.add("dog");
        list.add("cat");
        list.add("bird");

        // Remove duplicates from ArrayList of Strings.
        ArrayList<String> unique = removeDuplicates(list);
        for (String element : unique) {
            System.out.println(element);
        }
    }
}dog
cat
bird

`HashSet` constructor

This approach removes all duplicates, but the ordering of the elements is lost. We pass an ArrayList to the HashSet constructor, and then convert back to an ArrayList.

Note This code is simpler and shorter. It does not suffer from any huge performance issues.

Detail Using this direct and simple approach causes the elements to be reordered. The HashSet will not retain ordering.

import java.util.ArrayList;
import java.util.HashSet;

public class Program {
    public static void main(String[] args) {

        // Create ArrayList with one duplicate element.
        ArrayList<String> values = new ArrayList<>();
        values.add("cat");
        values.add("dog");
        values.add("cat");
        values.add("bird");

        // Create HashSet from ArrayList.
        HashSet<String> set = new HashSet<>(values);

        // Create ArrayList from the set.
        ArrayList<String> result = new ArrayList<>(set);

        // The result.
        System.out.println(result.toString());
    }
}
[cat, bird, dog]

Has duplicates

With a nested for-loop, we can check an ArrayList (or other array) for duplicate elements. We only search following (not preceding) elements for an exact match.

Tip If an element and a following element are equal, the ArrayList has a duplicate. Otherwise, no duplicates exist.

Tip 2 This loop-based method is sometimes faster than using a set like HashSet. Its performance is poor on large collections.

Thus Calling hasDuplicates on small lists where no duplicates are likely is often a good optimization.

import java.util.ArrayList;

public class Program {

    public static boolean hasDuplicates(ArrayList<String> list) {
        for (int i = 0; i < list.size(); i++) {
            // Loop over all following elements.
            for (int x = i + 1; x < list.size(); x++) {
                // If two elements equal, there is a duplicate.
                if (list.get(i) == list.get(x)) {
                    return true;
                }
            }
        }
        // No duplicates found.
        return false;
    }

    public static void main(String[] args) {

        // This ArrayList has a duplicate.
        ArrayList<String> values = new ArrayList<>();
        values.add("green");
        values.add("blue");
        values.add("red");
        values.add("pink");
        values.add("pink");

        // See if duplicates exist.
        if (hasDuplicates(values)) {
            System.out.println(true);
        }
        // Remove all pink strings.
        values.removeIf(element -> element == "pink");
        // No more duplicates.
        System.out.println(hasDuplicates(values));
    }
}true
false

Testing versus removal

Here is an optimization. If an ArrayList rarely contains duplicates, and has few elements, we can avoid modifying it. We can just test for duplicates.

Version 1 This code loops over the ArrayList to see if any duplicates exist. It does not try to remove the duplicates.

Version 2 This code removes the duplicates without first checking whether any duplicates are present.

Result In this benchmark, we find that on a 4-element ArrayList, calling hasDuplicates is much faster than removeDuplicates.

Warning If lists usually have duplicates, hasDuplicates() will not help performance. And it will be slow on long lists due to looping.

import java.util.*;

public class Program {

    static boolean hasDuplicates(ArrayList<String> list) {
        // See if duplicates exist.
        for (int i = 0; i < list.size(); i++) {
            for (int x = i + 1; x < list.size(); x++) {
                if (list.get(i) == list.get(x)) {
                    return true;
                }
            }
        }
        return false;
    }

    static ArrayList<String> removeDuplicates(ArrayList<String> list) {
        // Remove Duplicates: place them in new list (see above example).
        ArrayList<String> result = new ArrayList<>();
        HashSet<String> set = new HashSet<>();
        for (String item : list) {
            if (!set.contains(item)) {
                result.add(item);
                set.add(item);
            }
        }
        return result;
    }

    public static void main(String[] args) {

        ArrayList<String> elements = new ArrayList<>();
        elements.add("one");
        elements.add("two");
        elements.add("three");
        elements.add("four");

        long t1 = System.currentTimeMillis();

        // Version 1: test to see if duplicates exist.
        for (int i = 0; i < 10000000; i++) {
            if (hasDuplicates(elements)) {
                System.out.println(false);
            }
        }

        long t2 = System.currentTimeMillis();

        // Version 2: process list even if no duplicates.
        for (int i = 0; i < 10000000; i++) {

            ArrayList<String> copy = removeDuplicates(elements);
            if (copy == null) {
                System.out.println(false);
            }
        }

        long t3 = System.currentTimeMillis();

        // ... Benchmark results.
        System.out.println(t2 - t1);
        System.out.println(t3 - t2);
    }
} 128 ms:    hasDuplicates()
1409 ms:    removeDuplicates()

Performance notes

For small lists, simply looping over elements already added, and not adding duplicate ones, is faster. But this approach would lead to problems with large collections.

Also Another strategy is to not add duplicates in the first place. A custom method could prevent duplicates from being added.

Sets versus loops

With set-based methods, we can remove duplicate Strings from an ArrayList. Other approaches, which involve nested loops to test elements, are superior in some cases.

For duplicate removal ("dedupe") there are many possible options. This is a good thing. It means we can select the one that is best for our specific use case.