Dot Net Perls

Regex Match Use and Options - C#

by Sam Allen

Problem

Isolate part of a string based on matching patterns around it. Use regular expressions for maximum clarity and performance. You want to be restrictive and precise in what you match, making Regex an ideal approach. Take parts of input strings and separate them.

Input stringYour required string
/Content/Some-Page.aspxsome-page
/content/alternate-1.aspxalternate-1
/images/something.png 

Solution: C#

Here we know we have a substring that comes before what we want to 'extract', and then a substring that comes after. First, we can approach this with IndexOf and LastIndexOf, but that approach is fraught with complexity and many lines of code.

1. Use static Regex

My first approach to this problem (after giving up on IndexOf) was simply a static Regex. However, this causes an unnecessary performance drain. The following example shows how I used the Regex.Match static method.

string path = "/content/alternate-1.aspx";

Match match = Regex.Match(path, @"content/([A-Za-z0-9\-]+)\.aspx$",
    RegexOptions.IgnoreCase);

if (match.Success)
{
    string key = match.Groups[1].Value;
}

Note: index starts at 1!

This is an annoyance to me, but the indexing of the Groups collection on Match objects starts at 1. Some computer languages start with 1, but C# doesn't usually. It does here, and we must remember this.

2. Use ToLower instead of IgnoreCase

I found that that by using ToLower instead of IgnoreCase on the Regex yielded a 10% or higher improvement. Clearly, using RegexOptions.IgnoreCase is not always worthwhile, and since I needed a lowercase result, calling the C# string ToLower method first was a win.

// Lowercase our input first for a performance boost.
string path = pathInput.ToLower();
Match match = Regex.Match(path, @"content/([A-Za-z0-9\-]+)\.aspx$");

3. Use Regex instance

A Regex instance object is faster than using the static Regex.Match, and in important places in your code, always use an instance object. For my project, I created a static class that can be used in the entire project. This version performed nearly twice as well.

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Regexes for use on the site.
/// </summary>
static class RegexUtil
{
    static Regex _regex;
    static RegexUtil()
    {
        _regex = new Regex(@"/content/([a-z0-9\-]+)\.aspx$");
    }

    /// <summary>
    /// Return the key that is matched within the path.
    /// </summary>
    static public string MatchKey(string path)
    {
        Match match = _regex.Match(path.ToLower());
        if (match.Success)
        {
            return match.Groups[1].Value;
        }
        else
        {
            return null;
        }
    }
}

4. Add RegexOptions.Compiled flag

In the previous code, I modified it to use the Compiled flag, which means that the compiler actually converts the Regex to MSIL (intermediate language). I am not 100% clear on how this works, but the general idea is simple. By using compiled, we get a 30% or higher performance improvement.

// Add compiled flag for 30% boost
_regex = new Regex(@"/content/([a-z0-9\-]+)\.aspx$", RegexOptions.Compiled);

5. RightToLeft improves performance

Here I added the RegexOptions.RightToLeft flag. Even with the end-pattern matching character ($), this improved performance for my application. (Note that these strings are being matched by their ends, which makes RightToLeft a perfect tool.)

// Combine Compiled with RightToLeft
_regex = new Regex(@"/content/([a-z0-9\-]+)\.aspx$",
    RegexOptions.Compiled | RegexOptions.RightToLeft);
// (Note how the options are combined with the bitwise operator |.)

Final version

We have seen the iterations of this Regex, and I was happy with the results. The final method is much safer, more precise, and probably easier to maintain than the original method with string methods. It may even improve performance by reducing stray exceptions.

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Regexes for use on the site.
/// </summary>
static class RegexUtil
{
    static Regex _regex;
    static RegexUtil()
    {
        _regex = new Regex(@"/content/([a-z0-9\-]+)\.aspx$",
            RegexOptions.Compiled | RegexOptions.RightToLeft);
    }

    /// <summary>
    /// Return the key that is matched within the path.
    /// </summary>
    static public string MatchKey(string path)
    {
        Match match = _regex.Match(path.ToLower());
        if (match.Success)
        {
            return match.Groups[1].Value;
        }
        else
        {
            return null;
        }
    }
}

Benchmark results. The progression here shows how you can make a Regex faster and simpler. More optimized string handling could improve the plain string version further, and make it more reliable, but the Regex version is probably best.

Using the MatchKey method

You can call the above static class method with very simple syntax. What I show next is a snippet of calling code that will return the "key" within two substrings in the input string. This is ideal for URL rewriting on web sites.

string key = RegexUtil.MatchKey(path);
if (key != null)
{
    // key was found and is set
}

Conclusion

The Regex class is ideal for matching patterns in strings. By using Regex.Match here, I greatly simplified and made more foolproof my code for matching substrings. This is critical for programs that can accept user input. Use this method for matching input that is between two substrings.

Dot Net Perls
About
Sitemap
Source code
RSS
Regexes
Regex Replace With MatchEvaluator
Scraping HTML Links With Regex
Remove Whitespace From String
Regex Match Use and Options
Word Count Regex
Recent
Pi
NGEN Installer Class
List Element Equality
DateTime Tips and Tricks
Remove HTML Tags From String
© 2008 Sam Allen. All rights reserved.