Remove
duplicatesSometimes duplicates are useful, but often they are not. Duplicates in an ArrayList
can be removed. A HashSet
that records encountered elements is helpful.
It is often best to remove duplicates as we encounter them. With a Java HashMap
, we can maintain a hash of encountered elements.
The removeDuplicates
method accepts an ArrayList
of Strings. In it, we loop through that argument list. If our HashSet
contains the String
, we do nothing.
HashSet
does not contain the String
, we add it to the HashSet
(so it is detected next time).ArrayList
. This collection will retain the ordering of the original, but not duplicates.RemoveDuplicates()
could be easily adapted to work upon Integers or other types. Try changing the String
type to Integer.import java.util.ArrayList; import java.util.HashSet; public class Program { static ArrayList<String> removeDuplicates(ArrayList<String> list) { // Store unique items in result. ArrayList<String> result = new ArrayList<>(); // Record encountered Strings in HashSet. HashSet<String> set = new HashSet<>(); // Loop over argument list. for (String item : list) { // If String is not in set, add it to the list and the set. if (!set.contains(item)) { result.add(item); set.add(item); } } return result; } public static void main(String[] args) { ArrayList<String> list = new ArrayList<>(); list.add("dog"); list.add("cat"); list.add("dog"); list.add("dog"); list.add("cat"); list.add("bird"); // Remove duplicates from ArrayList of Strings. ArrayList<String> unique = removeDuplicates(list); for (String element : unique) { System.out.println(element); } } }dog cat bird
HashSet
constructorThis approach removes all duplicates, but the ordering of the elements is lost. We pass an ArrayList
to the HashSet
constructor, and then convert back to an ArrayList
.
HashSet
will not retain ordering.import java.util.ArrayList; import java.util.HashSet; public class Program { public static void main(String[] args) { // Create ArrayList with one duplicate element. ArrayList<String> values = new ArrayList<>(); values.add("cat"); values.add("dog"); values.add("cat"); values.add("bird"); // Create HashSet from ArrayList. HashSet<String> set = new HashSet<>(values); // Create ArrayList from the set. ArrayList<String> result = new ArrayList<>(set); // The result. System.out.println(result.toString()); } }[cat, bird, dog]
With a nested for
-loop, we can check an ArrayList
(or other array) for duplicate elements. We only search following (not preceding) elements for an exact match.
ArrayList
has a duplicate. Otherwise, no duplicates exist.HashSet
. Its performance is poor on large collections.hasDuplicates
on small lists where no duplicates are likely is often a good optimization.import java.util.ArrayList; public class Program { public static boolean hasDuplicates(ArrayList<String> list) { for (int i = 0; i < list.size(); i++) { // Loop over all following elements. for (int x = i + 1; x < list.size(); x++) { // If two elements equal, there is a duplicate. if (list.get(i) == list.get(x)) { return true; } } } // No duplicates found. return false; } public static void main(String[] args) { // This ArrayList has a duplicate. ArrayList<String> values = new ArrayList<>(); values.add("green"); values.add("blue"); values.add("red"); values.add("pink"); values.add("pink"); // See if duplicates exist. if (hasDuplicates(values)) { System.out.println(true); } // Remove all pink strings. values.removeIf(element -> element == "pink"); // No more duplicates. System.out.println(hasDuplicates(values)); } }true false
Here is an optimization. If an ArrayList
rarely contains duplicates, and has few elements, we can avoid modifying it. We can just test for duplicates.
ArrayList
to see if any duplicates exist. It does not try to remove the duplicates.ArrayList
, calling hasDuplicates
is much faster than removeDuplicates
.hasDuplicates()
will not help performance. And it will be slow on long lists due to looping.import java.util.*; public class Program { static boolean hasDuplicates(ArrayList<String> list) { // See if duplicates exist. for (int i = 0; i < list.size(); i++) { for (int x = i + 1; x < list.size(); x++) { if (list.get(i) == list.get(x)) { return true; } } } return false; } static ArrayList<String> removeDuplicates(ArrayList<String> list) { // Remove Duplicates: place them in new list (see above example). ArrayList<String> result = new ArrayList<>(); HashSet<String> set = new HashSet<>(); for (String item : list) { if (!set.contains(item)) { result.add(item); set.add(item); } } return result; } public static void main(String[] args) { ArrayList<String> elements = new ArrayList<>(); elements.add("one"); elements.add("two"); elements.add("three"); elements.add("four"); long t1 = System.currentTimeMillis(); // Version 1: test to see if duplicates exist. for (int i = 0; i < 10000000; i++) { if (hasDuplicates(elements)) { System.out.println(false); } } long t2 = System.currentTimeMillis(); // Version 2: process list even if no duplicates. for (int i = 0; i < 10000000; i++) { ArrayList<String> copy = removeDuplicates(elements); if (copy == null) { System.out.println(false); } } long t3 = System.currentTimeMillis(); // ... Benchmark results. System.out.println(t2 - t1); System.out.println(t3 - t2); } } 128 ms: hasDuplicates() 1409 ms: removeDuplicates()
For small lists, simply looping over elements already added, and not adding duplicate ones, is faster. But this approach would lead to problems with large collections.
With set
-based methods, we can remove duplicate Strings from an ArrayList
. Other approaches, which involve nested loops to test elements, are superior in some cases.
For duplicate removal ("dedupe") there are many possible options. This is a good thing. It means we can select the one that is best for our specific use case.