Strings in C# often contain duplicate words. And often these duplicate words are not useful. It is possible to remove them.
This is similar to the concept of removing stop words—common words that lack meaning. A lookup table like Dictionary
can be used in a loop.
Consider a string
like "yellow bird blue bird." We want our algorithm to figure out that the word "bird" is repeated, and to remove it.
yellow bird blue bird yellow bird blue
We use a Dictionary
for constant-time look up. We will be processing words in a loop, and we need to check each word against all words already encountered.
StringBuilder
for performance. The Dictionary
stores words already encountered.char
array to string
Split
, we can deal with punctuation.var
refers to the Dictionary
—it is a way to simplify the syntax of the program.using System; using System.Collections.Generic; using System.Text; class Program { static void Main() { string s = "yellow bird, blue bird, yellow sun"; Console.WriteLine(s); Console.WriteLine(RemoveDuplicateWords(s)); } static public string RemoveDuplicateWords(string v) { // Keep track of words found in this Dictionary. var d = new Dictionary<string, bool>(); // Buildup string into this StringBuilder. StringBuilder b = new StringBuilder(); // Split the input. string[] a = v.Split(new char[] { ' ', ',', ';', '.' }, StringSplitOptions.RemoveEmptyEntries); // Loop over each word. foreach (string current in a) { // Lowercase each word. string lower = current.ToLower(); // If we haven't already encountered the word, append it to the result. if (!d.ContainsKey(lower)) { b.Append(current).Append(' '); d.Add(lower, true); } } // Return a string. return b.ToString().Trim(); } }yellow bird, blue bird, yellow sun yellow bird blue sun
I used this code, and also a variant that removes stop words, to implement a full-text-search feature in a Windows Forms program. A special full-text search database is useful.
We combined Dictionary
with StringBuilder
to develop a method that removes duplicate English words efficiently. The code does lookups on each word as it encounters them.