C#Dot Net Perls

C#
Word Count Regex

by Sam Allen

Problem

You have a string that contains English words and you need to count the words in it similarly to how Microsoft Word counts words, which is the industry standard. Words are broken by punctuation, a space, or by being at the start or the end of string. Here are some required results.

When run on this string You want to get this number
To be or not to be, that is the question. 10
Mary had a little lamb. 5

C# Solution

Here I show two methods. Each way yields fairly similar results to Microsoft Word and Microsoft Office. This article presents algorithms that are fairly accurate and useful in different scenarios. First I show the Regex word count version, and then the loop word count version.

using System;
using System.Text;
using System.Text.RegularExpressions;

/// <summary>
/// This is original code that implements two word-count algorithms
/// that yield very similar results to Microsoft Word.
/// </summary>
static class WordCount
{
    /// <summary>
    /// Count the number of words in the string using regular expression.
    /// This method is more ACCURATE and SIMPLER.
    /// </summary>
    /// <param name="textIn">The string you want a word count of.</param>
    /// <returns>The number of words in the string.</returns>
    static public int CountRegex(string textIn)
    {
        MatchCollection collection = Regex.Matches(textIn, @"[\S]+");
        return collection.Count;
    }

    /// <summary>
    /// Count words (using spaces and character analysis).
    /// </summary>
    /// <param name="str">The string to count the words of.</param>
    /// <returns>The number of words in the document.</returns>
    static public int Count(string str)
    {
        int c = 0;
        for (int i = 1; i < str.Length; i++)
        {
            if (char.IsWhiteSpace(str[i - 1]) == true)
            {
                if (char.IsLetterOrDigit(str[i]) == true ||
                    char.IsPunctuation(str[i]))
                {
                    c++;
                }
            }
        }
        if (str.Length > 2)
        {
            c++;
        }
        return c;
    }
}

Microsoft Word Comparison

Microsoft Office dominates the business world, so I will provide some stats about the results of these 2 algorithms versus Microsoft Word 2007. Here are my results for five text documents.

Percent difference from MS Word.
Document ID Description Microsoft Word 2007 Old loop method Regex method
1 Personal Journal 4007 3973 3990
2 EULA 1414 1399 1414
3 Wikipedia article 462 459 463
4 Medical report 470 465 470
5 University report (art history) 2742 2710 2738
- Total 9095 9006 9075

Discussion

View these methods more easily (and copy them) at my source code site. The regular expression method here is better. It is clear to me that unless you have a strict performance requirement, the Regex method is easier to maintain and more accurate. It is probably very fast on large blocks of text. These methods should work with a variety of Romance languages.

Dot Net Perls is dedicated to sharing code and knowledge. It has
© 2007-2008 Sam Allen. All rights reserved.

Ads by The Lounge