C#Dot Net Perls

C#
Sanitize String

by Sam Allen

Problem

Develop a method that will remove punctuation and whitespace and return a 'cleaned' (sanitized) version. The method must return a series of words separated by a 1-space delimiter. This output can easily be used by full-text indexes and other applications. The following table shows some example input and output.

When you send this input You need to receive
SomeText&--777 And where before. SomeText 777 And where before

C# Solution

You will probably have other requirements, but I present this code to show an approach to the problem. The following static method receives a string, and returns a string stripped of unwanted characters. Here we look at the method and its internals, and my explanations follow.

/// <summary>
/// Static class containing helper string methods.
/// </summary>
static public class StringUtil
{
    /// <summary>
    /// Strip the input string of characters that are not letters or digits.
    /// Replace characters with spaces and then strip all spaces
    /// in a sequence until there is only a single space.
    /// </summary>
    /// <param name="selText">The string you want to process (sanitize).</param>
    /// <returns>The new sanitized version of the string.</returns>
    static public string SanitizeString(string selText)
    {
        // We build up a new string with the StringBuilder, and keep track of spaces
        // with a bool variable.
        StringBuilder res = new StringBuilder();
        bool lastWasSpace = false;

        for (int i = 0; i < selText.Length; i++)
        {
            if (char.IsLetterOrDigit(selText[i]))
            {
                res.Append(selText[i]);
                lastWasSpace = false;
            }
            else if (char.IsWhiteSpace(selText[i]) || char.IsPunctuation(selText[i]))
            {
                // Replace any number of whitespace or punctuation characters
                // in a sequence into a single space.
                if (lastWasSpace == false)
                {
                    res.Append(' ');
                    lastWasSpace = true;
                }
            }
        }
        // Return the sanitized string.
        return res.ToString();
    }
}

Conclusion

This code is highly effective and simple way of removing unwanted characters. It performs its task in time linear to the length of the string. I want to conclude by saying that regular expressions can be thought of as sledgehammers, and you sometimes want a pair of pliers, like this method.

Dot Net Perls is dedicated to sharing code and knowledge. It has
© 2007-2008 Sam Allen. All rights reserved.

Ads by The Lounge