Word count. Strings in C# programs often contain multiple words. These words can be counted in a method. The total count must be similar to common tools.
Words are broken by punctuation, a space, or by being at the start or end of a string. We must detect these separators in C# code.
Example code. First, here we see 2 word-counting methods. The example program first executes the Regex word count function, and then the loop-based one.
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string t1 = "Cat, bird and dog.";
int res1 = WordCounting.CountWords1(t1);
int res2 = WordCounting.CountWords2(t1);
Console.WriteLine(res1);
Console.WriteLine(res2);
}
}
/// <summary>/// Contains methods for counting words./// </summary>
public static class WordCounting
{
/// <summary>/// Count words with Regex./// </summary>
public static int CountWords1(string s)
{
MatchCollection collection = Regex.Matches(s, @"[\S]+");
return collection.Count;
}
/// <summary>/// Count word with loop and character tests./// </summary>
public static int CountWords2(string s)
{
int c = 0;
for (int i = 1; i < s.Length; i++)
{
if (char.IsWhiteSpace(s[i - 1]) == true)
{
if (char.IsLetterOrDigit(s[i]) == true ||
char.IsPunctuation(s[i]))
{
c++;
}
}
}
if (s.Length > 2)
{
c++;
}
return c;
}
}4
4
Accuracy. Here I provide some statistics about these 2 algorithms versus Microsoft Word. The Regex method has results that differ by about 0.02% from Microsoft Word.
Document A
Microsoft Word: 4007 words
Regex method: 3990 words [closest]
Loop method: 3973 words
Document B
Microsoft Word: 1414 words
Regex method: 1414 words [closest]
Loop method: 1399 words
Document C
Microsoft Word: 462 words
Regex method: 463 words [closest]
Loop method: 459 words
Document D
Microsoft Word: 470 words
Regex method: 470 words [closest]
Loop method: 465 words
Document E
Microsoft Word: 2742 words
Regex method: 2738 words [closest]
Loop method: 2710 wordsInput: To be or not to be, that is the question.
Mary had a little lamb.
Word count: 10
5
Example 2. Here we specify that a certain character (like "#") is also a word separator. We use character ranges to specify valid word characters.
Note If you omit a character from the ranges, that character is considered a word separator.
Here You can see that with this version of the Regex, the substring "is#the#question" is treated as 3 separate words.
Tip This is because the pound sign is not included in the ranges of valid characters in the pattern.
And With this form of the Regex pattern, you can more easily change which characters are valid and which are not.
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string t1 = "To be or not to be, that is#the#question.";
Console.WriteLine(CountWordsModified(t1));
}
static int CountWordsModified(string s)
{
return Regex.Matches(s, @"[A-Za-z0-9]+").Count;
}
}10
Performance. Testing each character in a loop would be faster. It is nearly optimal, while the Regex-based method would draw in far more computation.
Tip You can store the Regex object it uses as an instance member or field of the class.
Then You can simply call its instance Matches method instead of the static Regex.Matches method. This improves speed.
A summary. We saw 2 word count methods, both of which provide results similar to Microsoft Word. The Regex-using one is closer to Microsoft Word's results.
Dot Net Perls is a collection of tested code examples. Pages are continually updated to stay current, with code correctness a top priority.
Sam Allen is passionate about computer languages. In the past, his work has been recommended by Apple and Microsoft and he has studied computers at a selective university in the United States.