Your string contains English words, and you need to count the words in it. The total count must be similar to Microsoft Word 2007. Words are broken by punctuation, a space, or by being at the start or the end of string. This page contains an excellent word counting implementation in the C# programming language.
=== Accuracy of word counting methods ===
Document A
Microsoft Word: 4007 words
Regex method: 3990 words [closest]
Loop method: 3973 words
Document B
Microsoft Word: 1414 words
Regex method: 1414 words [closest]
Loop method: 1399 words
Document C
Microsoft Word: 462 words
Regex method: 463 words [closest]
Loop method: 459 words
Document D
Microsoft Word: 470 words
Regex method: 470 words [closest]
Loop method: 465 words
Document E
Microsoft Word: 2742 words
Regex method: 2738 words [closest]
Loop method: 2710 words
=== Example input and output ===
Input: To be or not to be, that is the question.
Mary had a little lamb.
Word count: 10
5First, here we see two word counting methods, both of which yield fairly similar results to Microsoft Word from Microsoft Office 2007. The example program first executes the Regex word count function, and then the loop-based one.
~~~ Program that counts words (C#) ~~~
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string t1 = "To be or not to be, that is the question.";
Console.WriteLine(WordCounting.CountWords1(t1));
Console.WriteLine(WordCounting.CountWords2(t1));
const string t2 = "Mary had a little lamb.";
Console.WriteLine(WordCounting.CountWords1(t2));
Console.WriteLine(WordCounting.CountWords2(t2));
}
}
/// <summary>
/// Contains methods for counting words.
/// </summary>
public static class WordCounting
{
/// <summary>
/// Count words with Regex.
/// </summary>
public static int CountWords1(string s)
{
MatchCollection collection = Regex.Matches(s, @"[\S]+");
return collection.Count;
}
/// <summary>
/// Count word with loop and character tests.
/// </summary>
public static int CountWords2(string s)
{
int c = 0;
for (int i = 1; i < s.Length; i++)
{
if (char.IsWhiteSpace(s[i - 1]) == true)
{
if (char.IsLetterOrDigit(s[i]) == true ||
char.IsPunctuation(s[i]))
{
c++;
}
}
}
if (s.Length > 2)
{
c++;
}
return c;
}
}
~~~ Output of the program ~~~
10
10
5
5It has static methods. This code is ideally contained in static methods because it doesn't maintain state or any data. You can think of it as an action, not an object. The methods each receive a string. Both approaches above receive a string and return an integer equal to the number of words they calculate.
The first method uses Regex. The first method here, CountWords1, is better in every way except perhaps performance. It is shorter and simpler to maintain, and is also considerably more accurate. The backslash-S characters (\S) mean characters that are not spaces. So the first method considers each non-letter character to be part of a word, similar to Microsoft Word.
Microsoft Office dominates the business world, so I will provide some stats about the results of these two algorithms versus Microsoft Word 2007. Here are my results for the five text documents A - E. You can see the results of this experiment at the top of this document.
=== Difference from MS Word === Regex method: 0.0220% Loop method: 0.0990%
Calculations. My calculations revealed that the first method on this page, the Regex method, has results that differ by about 0.02% from Microsoft Word. In the first chart, you can see the difference in absolute words.
The second method, which tests each character in a loop, would be many times faster if carefully benchmarked. In fact, it is nearly optimal, while the Regex-based method would draw in far more computation. Regular expressions are relatively slow in .NET and compiled languages. However, their greater ease of use and clarity is often far more important. In scripting languages, regular expressions often perform far better.
Optimization option. Store the Regex object it uses as an instance member or field of the class. Then, simply call its instance Matches method instead of the static Regex.Matches method. My research has shown this could improve speed by 2x.
Here we saw two word count methods, both of which provide results similar to Microsoft Word 2007. The first method, the Regex-using one, is considerably closer to Microsoft Word's results. However, there is a small percentage difference. The algorithms here could be improved to offer even better compatibility with Microsoft Office. Ideally, an algorithm using the loop could yield equivalent results.