Regex
Programs read in text and often must process it in some way. Often the easiest way to process text is with regular expressions. The Regex
class
in C# helps here.
With methods like Match
, we pass in a pattern, and receive matches based on that pattern. We can optionally create a Regex
instance first.
This program introduces the Regex
class
. Regex
, and Match
, are found in the System.Text.RegularExpressions
namespace.
Regex
. The Regex
uses a pattern that indicates one or more digits.Match
method on the Regex
. The characters "55" match the pattern specified in step 1.Match
object has a bool
property called Success. If it equals true, we found a match.using System; using System.Text.RegularExpressions; // Step 1: create new Regex. Regex regex = new Regex(@"\d+"); // Step 2: call Match on Regex instance. Match match = regex.Match("a55a"); // Step 3: test for Success. if (match.Success) { Console.WriteLine("MATCH VALUE: " + match.Value); }MATCH VALUE: 55
We do not need to create a Regex
instance to use Match
: we can invoke the static
Regex.Match
. This example builds up some complexity—we access Groups after testing Success.
string
we are testing. Notice how it has a file name part inside a directory name and extension.Regex.Match
static
method. The second argument is the pattern we wish to match with.Match
with the Success property. When true, a Match
occurred and we can access its Value or Groups.using System; using System.Text.RegularExpressions; // Part 1: the input string. string input = "/content/alternate-1.aspx"; // Part 2: call Regex.Match. Match match = Regex.Match(input, @"content/([A-Za-z0-9\-]+)\.aspx$", RegexOptions.IgnoreCase); // Part 3: check the Match for Success. if (match.Success) { // Part 4: get the Group value and display it. string key = match.Groups[1].Value; Console.WriteLine(key); }alternate-1
We can use metacharacters to match the start and end of strings. This is often done when using regular expressions. Use "^" to match the start, and "$" for the end.
Match
object like Regex.Match
, IsMatch
just returns bool
that indicates success.Regex.Match
—it will return any possible matches at those positions.using System; using System.Text.RegularExpressions; string test = "xxyy"; // Match the start of a string. if (Regex.IsMatch(test, "^xx")) { Console.WriteLine("START MATCHES"); } // Match the end of a string. if (Regex.IsMatch(test, "yy$")) { Console.WriteLine("END MATCHES"); }START MATCHES END MATCHES
NextMatch
More than one match may be found. We can call NextMatch()
to search for a match that comes after the current one in the text. NextMatch
can be used in a loop.
Regex.Match
. Two matches occur. This call to Regex.Match
returns the first Match
only.NextMatch
returns another Match
object—it does not modify the current one. We assign a variable to it.using System; using System.Text.RegularExpressions; string value = "4 AND 5"; // Step 1: get first match. Match match = Regex.Match(value, @"\d"); if (match.Success) { Console.WriteLine(match.Value); } // Step 2: get second match. match = match.NextMatch(); if (match.Success) { Console.WriteLine(match.Value); }4 5
Replace
Sometimes we need to replace a pattern of text with some other text. Regex.Replace
helps. We can replace patterns with a string
, or with a value determined by a MatchEvaluator
.
string
. The 2 digit sequences are replaced with "bird."using System; using System.Text.RegularExpressions; // Replace 2 or more digit pattern with a string. Regex regex = new Regex(@"\d+"); string result = regex.Replace("cat 123 456", "bird"); Console.WriteLine("RESULT: {0}", result);RESULT: cat bird bird
Some regular expressions want to match as many characters as they can—this is the default behavior. But with the "?" metacharacter, we can change this.
using System; using System.Text.RegularExpressions; string test = "/bird/cat/"; // Version 1: use lazy (or non-greedy) metacharacter. var result1 = Regex.Match(test, "^/.*?/"); if (result1.Success) { Console.WriteLine("NON-GREEDY: {0}", result1.Value); } // Version 2: default Regex. var result2 = Regex.Match(test, "^/.*/"); if (result2.Success) { Console.WriteLine("GREEDY: {0}", result2.Value); }NON-GREEDY: /bird/ GREEDY: /bird/cat/
Often a Regex
instance object is faster than the static
Regex.Match
. For performance, we should usually use an instance object. It can be shared throughout an entire project.
Match
once in a program's execution. A Regex
object does not help here.static
class
stores an instance Regex
that can be used project-wide. We initialize it inline.using System; using System.Text.RegularExpressions; class Program { static void Main() { // The input string again. string input = "/content/alternate-1.aspx"; // This calls the static method specified. Console.WriteLine(RegexUtil.MatchKey(input)); } } static class RegexUtil { static Regex _regex = new Regex(@"/content/([a-z0-9\-]+)\.aspx$"); /// <summary> /// This returns the key that is matched within the input. /// </summary> static public string MatchKey(string input) { Match match = _regex.Match(input.ToLower()); if (match.Success) { return match.Groups[1].Value; } else { return null; } } }alternate-1
Match
, parse numbersA common requirement is extracting a number from a string
. We can do this with Regex.Match
. To get further numbers, consider Matches()
or NextMatch
.
string
representation of that number.int.Parse
or int.TryParse
on the Value here. This will convert it to an int
.using System; using System.Text.RegularExpressions; string input = "Dot Net 100 Perls"; Match match = Regex.Match(input, @"\d+"); if (match.Success) { int.TryParse(match.Value, out int number); // Show that we have the numbers. Console.WriteLine("NUMBERS: {0}, {1}", number, number + 1); }NUMBERS: 100, 101
A Match
object, returned by Regex.Match
has a Value, Length
and Index
. These describe the matched text (a substring of the input).
string
. This is a substring of the original input.Length
is the length of the Value string
. Here, the Length
of "AXXXXY" is 6.Index
is the index where the matched text begins within the input string. The character "A" starts at index 4 here.using System; using System.Text.RegularExpressions; Match m = Regex.Match("123 AXXXXY", @"A.*Y"); if (m.Success) { Console.WriteLine($"Value = {m.Value}"); Console.WriteLine($"Length = {m.Length}"); Console.WriteLine($"Index = {m.Index}"); }Value = AXXXXY Length = 6 Index = 4
IsMatch
This method tests for a matching pattern. It does not capture groups from this pattern. It just sees if the pattern exists in a valid form in the input string.
IsMatch
returns a bool
value. Both overloads receive an input string
that is searched for matches.static
Regex.IsMatch
method, a new Regex
is created. This is done in the same way as any instance Regex
.using System; using System.Text.RegularExpressions; class Program { /// <summary> /// Test string using Regex.IsMatch static method. /// </summary> static bool IsValid(string value) { return Regex.IsMatch(value, @"^[a-zA-Z0-9]*$"); } static void Main() { // Test the strings with the IsValid method. Console.WriteLine(IsValid("dotnetperls0123")); Console.WriteLine(IsValid("DotNetPerls")); Console.WriteLine(IsValid(":-)")); // Console.WriteLine(IsValid(null)); // Throws an exception } }True True False
RegexOptions
With the Regex
type, the RegexOptions
enum
is used to modify method behavior. Often I find the IgnoreCase
value helpful.
Regex
text language. IgnoreCase
changes this.Regex
type acts upon newlines with the RegexOptions
enum
. This is often useful.using System; using System.Text.RegularExpressions; const string value = "TEST"; // ... This ignores the case of the "T" character. if (Regex.IsMatch(value, "t...", RegexOptions.IgnoreCase)) { Console.WriteLine(true); }True
Regex
Consider the performance of Regex.Match
. If we use the RegexOptions.Compiled
enum
, and use a cached Regex
object, we can get a performance boost.
static
Regex.Match
method, without any object caching.Match()
on this instance of the Regex
.static
field Regex
, and RegexOptions.Compiled
, our method completes twice as fast (tested on .NET 5 for Linux).Regex
will cause a program to start up slower, and may use more memory—so only compile hot Regexes.using System; using System.Diagnostics; using System.Text.RegularExpressions; class Program { static int Version1() { string value = "This is a simple 5string5 for Regex."; return Regex.Match(value, @"5\w+5").Length; } static Regex _wordRegex = new Regex(@"5\w+5", RegexOptions.Compiled); static int Version2() { string value = "This is a simple 5string5 for Regex."; return _wordRegex.Match(value).Length; } const int _max = 1000000; static void Main() { // Version 1: use Regex.Match. var s1 = Stopwatch.StartNew(); for (int i = 0; i < _max; i++) { if (Version1() != 8) { return; } } s1.Stop(); // Version 2: use Regex.Match, compiled Regex, instance Regex. var s2 = Stopwatch.StartNew(); for (int i = 0; i < _max; i++) { if (Version2() != 8) { return; } } s2.Stop(); Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns")); Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns")); } }265.90 ns Regex.Match 138.78 ns instanceRegex.Match, Compiled
Regex
and loopRegular expressions can be reimplemented with loops. For example, a loop can make sure that a string
only contains a certain range of characters.
string
must only contain the characters "a" through "z" lowercase and uppercase, and the ten digits "0" through "9."Regex.IsMatch
to tell whether the string
only has the range of characters specified.for
-loop to iterate through the character indexes in the string. It employs a switch
on the char
.Regex
performance has been improved.using System; using System.Diagnostics; using System.Text.RegularExpressions; class Program { static bool IsValid1(string path) { return Regex.IsMatch(path, @"^[a-zA-Z0-9]*$"); } static bool IsValid2(string path) { for (int i = 0; i < path.Length; i++) { switch (path[i]) { case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': { continue; } default: { return false; } } } return true; } const int _max = 1000000; static void Main() { // Version 1: use Regex. var s1 = Stopwatch.StartNew(); for (int i = 0; i < _max; i++) { if (IsValid1("hello") == false || IsValid1("$bye") == true) { return; } } s1.Stop(); // Version 2: use for-loop. var s2 = Stopwatch.StartNew(); for (int i = 0; i < _max; i++) { if (IsValid2("hello") == false || IsValid2("$bye") == true) { return; } } s2.Stop(); Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns")); Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns")); } }265.71 ns Regex.IsMatch 10.15 ns for, switch
Regular expressions are a concise way to process text data. We use Regex.Matches
, and IsMatch
, to check a pattern (evaluating its metacharacters) against an input string
.