note:prog:csharp_241206:001:index
C#: Remove Duplicates From a List (2024-12-06)
Local Backup
In this article, we will learn the different methods to remove duplicates from a list in C#. List<T> is a collection having elements of the same type <T> which can contain multiple occurrences of the same item. Sometimes, we want that every element is unique, and here we will check various ways to achieve this goal, comparing their performance.
-
-
Remove Duplicates by Using LINQ Methods
We begin by creating a list containing some duplicates:
List<int> listWithDuplicates = new List<int>() { 1, 2, 3, 4, 1, 2, 3, 4 };
We want to make the listWithDuplicates contain every number only once { 1, 2, 3, 4 }.
Using Distinct()
The first method we use is the LINQ Distinct():
listWithDuplicates.Distinct().ToList();
The Distinct() method does exactly what we need – removes the duplicates. It returns an IEnumerable<int> result and that’s why we need to perform ToList() if we need a list.
Using GroupBy() and Select() Combo
Going forward with LINQ methods, we can also perform:
listWithDuplicates.GroupBy(x => x).Select(d => d.First()).ToList();
In this case, we first group equal elements in listWithDuplicates using the GroupBy() method. For each group of equal elements, we return the first one with Select(). Since this operation returns an IEnumerable<int>, we invoke ToList() at the end.
Using Union()
The Union() method creates an union between two sequences, using the default equality comparer. The union of two lists contains the elements from both, excluding duplicates:
listWithDuplicates.Union(listWithDuplicates).ToList();
Here, we perform the union between listWithDuplicates and itself. This operation produces a IEnumerable<T> without duplicates and we convert into a List<T> with ToList().
Remove Duplicates From a List in C# by Using Other Collections
HashSet
HashSet<T> represents a unordered collection that contains no duplicate elements:
listWithDuplicates.ToHashSet().ToList();
In this case, we convert the list into a HashSet<T> and this operation returns an IEnumerable<T> without duplicates and then, we convert it into a List<T>.
Another interesting use of HashSet<T> is to initialize a Hashet from listWithDuplicates and then converting back to a list:
new HashSet<T>(listWithDuplicates).ToList();
In this case, when we initialize the HashSet, the compiler deletes the multiple values.
Dictionary
A Dictionary<TKey,TValue> maps a set of keys to a set of values:
var dict = new Dictionary<T, int>();
foreach (var s in listWithDuplicates)
{
dict.TryAdd(s, 1);
}
var distinctList = dict.Keys.ToList();
In the foreach loop, we call TryAdd(), that adds an item to the keys of the dictionary only if it is not already contained, since a key in a dictionary must be unique. When we exit from the loop, we take the keys and convert the collection to a list.
Empty List
We can also add the unique elements to another list:
var listWithoutDuplicates = new List<T>();
foreach (T item in listWithDuplicates)
{
if (!listWithoutDuplicates.Contains(item)) //we can also use !listWithoutDuplicates.Any(x => x.Equals(item))
{
listWithoutDuplicates.Add(item);
}
}
We check if listWithoutDuplicates already contains an item in the foreach loop, and if it doesn’t we add it. To verify it, we use Contains() LINQ method, but we can also use Any() and Equal() comparer and we will see if this difference impact performance in the related section.
Iterating Over The List
We can remove duplicate elements using iterations:
var n = listWithDuplicates.Count;
for (int i = 0; i < n; i++)
{
for (int j = i + 1; j < n; j++)
{
if (listWithDuplicates.ElementAt(i).Equals(listWithDuplicates.ElementAt(j)))
{
for (int k = j; k < n - 1; k++)
{
T item = listWithDuplicates.ElementAt(k);
item = listWithDuplicates.ElementAt(k + 1);
}
j--;
n--;
}
}
}
var distinctList = listWithDuplicates.Take(n).ToList();
Starting from index i = 0, we evaluate all the items starting from the index j = i + 1. If we find that i-th element is equal to the j-th element, then we shift every item to the left starting from j to n = listWithDuplicates.Count - 1.
Every time we find a duplicate, the size of the final list decreases by one. When we get out of the loops we will take the first ndistinct items contained in listWithDuplicates with Take() method.
Iterations With Swapping
As already suggested in the article for the removal of duplicates from the array, the iterative method can be improved with the swapping of elements in the loop:
var size = listWithDuplicates.Count;
for (int i = 0; i < size; i++)
{
for (int j = i + 1; j < size; j++)
{
if (listWithDuplicates.ElementAt(i)!.Equals(listWithDuplicates.ElementAt(j)))
{
size--;
T jThItem = listWithDuplicates.ElementAt(j);
jThItem = listWithDuplicates.ElementAt(size);
j--;
}
}
}
listWithDuplicates.Take(size).ToList();
We swap two nearest items if they are equal, and we decrease by one the size of the list every time this happens. When we exit from the loop, we take the first size items of listWithDuplicates. While similar to the previous approach, there are some performance implications that we’re going to explore at the end.
Recursive Approach
The recursion provides another considerable way to perform the removal of duplicates:
public List<T> UsingRecursion(List<T>? listWithDuplicates, List<T>? listWithoutDuplicates = default, int index = 0)
{
if (listWithoutDuplicates == null)
{
listWithoutDuplicates = new List<T>();
}
if (index >= listWithDuplicates.Count)
{
return listWithDuplicates;
}
if (listWithoutDuplicates.IndexOf(listWithDuplicates[index]) < 0)
{
listWithoutDuplicates.Add(listWithDuplicates[index]);
}
UsingRecursion(listWithDuplicates, listWithoutDuplicates, index + 1);
return listWithoutDuplicates.ToList();
}
In this case, we initialize listWithoutDuplicates as an empty list if it is null, and we check if the list contains an item that occupies the position given by index using the IndexOf() method.
If listWithoutDuplicates doesn’t contain the element, we add it to another list and we call again the UsingRecursion() method passing index incremented by one. The recursion stops when the passed value of index is greater than or equal to the size of the list.
Sorting
If we sort the list, we can compare elements two by two:
var listWithoutDuplicates = new List<T>();
listWithDuplicates = listWithDuplicates.OrderBy(x => x).ToList();
T element = default;
foreach (T result in listWithDuplicates)
{
if (!result.Equals(element))
{
listWithoutDuplicates.Add(result);
element = result;
}
}
We start by defining a new empty listWithoutDuplicates, sorting listWithDuplicates in ascending order, and defining the variable element with the default value. Then we loop over the items and compare everyone with the element adding it to listWithoutDuplicates only if it doesn’t contain it yet. After the eventual addition, we assign to element the current value in the loop.
We want to show the differences in terms of performance between the methods explored so far, using BenchmarkDotNet:
| Method | Mean | Error | StdDev | Rank | Gen 0 | Allocated |
|---------------------------- |-----------:|----------:|----------:|-----:|-------:|----------:|
| EmptyListWithContainsMethod | 1.213 us | 0.0115 us | 0.0107 us | 1 | 0.0172 | 72 B |
| DictionaryMethod | 1.338 us | 0.0213 us | 0.0189 us | 2 | 0.0687 | 288 B |
| ConvertToHashSetMethod | 2.409 us | 0.0482 us | 0.0610 us | 3 | 0.9918 | 4,160 B |
| DistinctLINQMethod | 2.424 us | 0.0431 us | 0.0382 us | 3 | 1.0071 | 4,224 B |
| InitializingHashSetMethod | 2.446 us | 0.0477 us | 0.0823 us | 3 | 0.9918 | 4,160 B |
| UnionLINQMethod | 3.981 us | 0.0401 us | 0.0313 us | 4 | 0.0916 | 392 B |
| GroupByLINQMethod | 4.653 us | 0.0428 us | 0.0526 us | 5 | 0.8698 | 3,664 B |
| RecursiveMethod | 7.302 us | 0.1636 us | 0.4507 us | 6 | 3.4561 | 14,472 B |
| SortMethod | 7.756 us | 0.1248 us | 0.1622 us | 7 | 1.9989 | 8,376 B |
| IterationsAndSwappingMethod | 9.103 us | 0.1149 us | 0.1018 us | 8 | 1.1597 | 4,888 B |
| EmptyListWithAnyMethod | 11.466 us | 0.2288 us | 0.5022 us | 9 | 8.5297 | 35,680 B |
| IterationsAndShiftingMethod | 389.509 us | 5.3991 us | 6.2176 us | 10 | 0.9766 | 4,888 B |
The results show how the method which uses EmptyList with Contains() has better performance in terms of the time of execution and it is also the one that needs less allocated memory.
It’s interesting to outline the difference between two iterative methods, that differ for the technique adopted to remove duplicates, despite both performing at least two loops. In fact, IterationsAndSwappingMethod() has a sharply shorter mean time of execution than IterationsAndShiftingMethod(), even if they both have the same memory allocation needed.
Moreover, the conversion of the list into a HashSet has a slightly better mean execution time than the initialization of a new HashSet from the list, even if they are equal in terms of the allocated memory.
Conclusions
Permalink note/prog/csharp_241206/001/index.txt · Last modified: 2024/12/06 10:56 by
jethro