Delete duplicate words (such as English words) from strings efficiently and simply. Often, duplicate words are not really useful to the program's operation, and can just slow it down. Input strings often will include natural language that is repetitive. Your application can be slowed down by processing these duplicate words. Here are some required results.
| With this input string | We want this result |
| Do or do not; there is no try. | Do or not there is no try |
| Dot Net Perls is a web site about the Dot Net Framework. | Dot Net Perls is a web site about the Framework |
The following code is in a static class and is a static method because it does not need to save state. It will remove words that have already been encountered, disregarding uppercase and lowercase. It can deal with some punctuation, such as commas, semicolons.
/// <summary>
/// Takes a C# string and removes duplicate words (words occurring more than once).
/// </summary>
/// <param name="inputValue">You want to remove duplicates from this.</param>
/// <returns>The original string with duplicates removed.</returns>
static public string RemoveDuplicateWords(string inputValue)
{
var wordsFound = new Dictionary<string, bool>();
StringBuilder builder = new StringBuilder();
// Split the input and handle spaces and punctuation.
string[] inputWords = inputValue.Split(new char[]{ ' ', ',', ';', '.' },
StringSplitOptions.RemoveEmptyEntries);
foreach (string currentWord in inputWords)
{
string lowerWord = currentWord.ToLower();
// Add this word to the result if it is not a stopword, and if
// it hasn't already occurred.
if (wordsFound.ContainsKey(lowerWord) == false)
{
builder.Append(currentWord).Append(' ');
wordsFound.Add(lowerWord, true);
}
}
return builder.ToString().Trim(); // Trim the string.
}
Here is exactly how you can call the RemoveDuplicateWords method. You can simply assign the results of the function to a string variable. This method doesn't modify the strings in place, but rather copies a new string to the result.
private void Example()
{
string test = "To be or not to be, that is the question. " +
"Except when I ask if he is there or, you.";
string simple = RemoveDuplicateWords(test);
Console.WriteLine(simple);
// Will print out a string with only one instance of "to", "be", "or", "is"
// Letter case will be preserved usually.
}
Download this code at my open-source code archive. The key parts are the special code for the Split function, and just the general technique related to using a Dictionary to test for duplication. The above code doesn't have any huge performance problems.