You have a string that contains English words and you need to count the words in it similarly to how Microsoft Word counts words, which is the industry standard. Words are broken by punctuation, a space, or by being at the start or the end of string. Here are some required results.
| When run on this string | You want to get this number |
| To be or not to be, that is the question. | 10 |
| Mary had a little lamb. | 5 |
Here I show two methods. Each way yields fairly similar results to Microsoft Word and Microsoft Office. This article presents algorithms that are fairly accurate and useful in different scenarios. First I show the Regex word count version, and then the loop word count version.
using System;
using System.Text;
using System.Text.RegularExpressions;
/// <summary>
/// This is original code that implements two word-count algorithms
/// that yield very similar results to Microsoft Word.
/// </summary>
static class WordCount
{
/// <summary>
/// Count the number of words in the string using regular expression.
/// This method is more ACCURATE and SIMPLER.
/// </summary>
/// <param name="textIn">The string you want a word count of.</param>
/// <returns>The number of words in the string.</returns>
static public int CountRegex(string textIn)
{
MatchCollection collection = Regex.Matches(textIn, @"[\S]+");
return collection.Count;
}
/// <summary>
/// Count words (using spaces and character analysis).
/// </summary>
/// <param name="str">The string to count the words of.</param>
/// <returns>The number of words in the document.</returns>
static public int Count(string str)
{
int c = 0;
for (int i = 1; i < str.Length; i++)
{
if (char.IsWhiteSpace(str[i - 1]) == true)
{
if (char.IsLetterOrDigit(str[i]) == true ||
char.IsPunctuation(str[i]))
{
c++;
}
}
}
if (str.Length > 2)
{
c++;
}
return c;
}
}
Microsoft Office dominates the business world, so I will provide some stats about the results of these 2 algorithms versus Microsoft Word 2007. Here are my results for five text documents.
| Document ID | Description | Microsoft Word 2007 | Old loop method | Regex method |
| 1 | Personal Journal | 4007 | 3973 | 3990 |
| 2 | EULA | 1414 | 1399 | 1414 |
| 3 | Wikipedia article | 462 | 459 | 463 |
| 4 | Medical report | 470 | 465 | 470 |
| 5 | University report (art history) | 2742 | 2710 | 2738 |
| - | Total | 9095 | 9006 | 9075 |
View these methods more easily (and copy them) at my source code site. The regular expression method here is better. It is clear to me that unless you have a strict performance requirement, the Regex method is easier to maintain and more accurate. It is probably very fast on large blocks of text. These methods should work with a variety of Romance languages.