Regex
In Java regexes, we match strings to patterns. We can match the string
"c.t" with "cat." We use things like Pattern.compile
and Matcher.
Regex
performanceRegular expressions often reduce performance. The special text language used has some costs. But in many places, regular expressions are an overall improvement.
Pattern.matches
exampleWe call Pattern.matches
in a loop. Its first argument is the regular expression's pattern. It also accepts the string
we want to test for matches.
import java.util.regex.Pattern; public class Program { public static void main(String[] args) { // Some strings to test. String[] inputs = { "dog", "dance", "cat", "dirt" }; // Loop over strings and test them. for (String input : inputs) { boolean b = Pattern.matches("d.+", input); System.out.println(b); } } }true true false true
Pattern.compile
and MatcherNext we learn a faster way to match regular expressions. We use Pattern.compile
to create a compiled pattern object.
matcher()
method on the pattern instance. This returns a Matcher class
instance.import java.util.regex.Matcher; import java.util.regex.Pattern; public class Program { public static void main(String[] args) { // Compile this pattern. Pattern pattern = Pattern.compile("num\\d\\d\\d"); // See if this String matches. Matcher m = pattern.matcher("num123"); if (m.matches()) { System.out.println(true); } // Check this String. m = pattern.matcher("num456"); if (m.matches()) { System.out.println(true); } } }true true
Often regular expression patterns use groups to capture parts of strings. Here we use positional groups. We access them by their position (1, 2 or more).
matches()
we access groups.import java.util.regex.Matcher; import java.util.regex.Pattern; public class Program { public static void main(String[] args) { Pattern pattern = Pattern.compile("(\\d+)\\-(\\d+)"); // Get matcher on this String. Matcher m = pattern.matcher("1234-5678"); // If it matches, get and display group values. if (m.matches()) { String part1 = m.group(1); String part2 = m.group(2); System.out.println(part1); System.out.println(part2); } } }1234 5678
With names, we easily access specific groups from a matched pattern. We use angle brackets to name groups in the pattern. Then we call group()
with a String
name argument.
import java.util.regex.Matcher; import java.util.regex.Pattern; public class Program { public static void main(String[] args) { // Specify a pattern with named groups. Pattern pattern = Pattern.compile("(?<first>..)x(?<second>..)"); Matcher m = pattern.matcher("c3xp0"); // Check for matches. // ... Then access named groups by their names. if (m.matches()) { String part1 = m.group("first"); String part2 = m.group("second"); System.out.println(part1); System.out.println(part2); } } }c3 p0
Pattern.quote
Characters must be escaped ("quoted") to avoid being seen as metacharacters. For example a star must be escaped to mean an asterisk, not a Kleene closure of "zero or more."
String
with a Q and an E. Between these characters, everything is escaped.Pattern.quote
, we receive a "dangling metacharacter" exception.import java.util.regex.Pattern; public class Program { public static void main(String[] args) { // Quote this value. String value = "*star"; String quote = Pattern.quote(value); System.out.println(value); System.out.println(quote); // Try matching with quoted value. boolean result1 = Pattern.matches(quote, "*star"); System.out.println(result1); // This fails because it was not quoted. boolean result2 = Pattern.matches(value, "*star"); System.out.println(result2); } }*star \Q*star\E true Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 *star ^
Often in regular expressions we want to match the start or end of strings. Two metacharacters are useful here: the "^" and the "$." These match the start, the end.
startsWithAEndsWithZ
tests a String
. It returns true if the first char
is "a" and the last is "z."startsWith
, endsWith
, charAt
) is more efficient. But it becomes harder to code when requirements change.import java.util.regex.Pattern; public class Program { public static boolean startsWithAEndsWithZ(String value) { // Test start and end characters. return Pattern.matches("^a.*z$", value); } public static void main(String[] args) { String[] values = { "a123z", "b123z", "az", "aq", "aza" }; // Loop over and test these Strings. for (String value : values) { System.out.print(value); System.out.print(' '); System.out.println(startsWithAEndsWithZ(value)); } } }a123z true b123z false az true aq false aza false
Split
, PatternA split method is available on Pattern instances. This lets us split based on a Regex
delimiter. The Pattern can be compiled once and reused many times.
import java.util.regex.Pattern; public class Program { public static void main(String[] args) { String line = "cat, dog, rabbit--100"; // Compile a Pattern that indicates a delimiter. Pattern p = Pattern.compile("\\W+"); // Split a String based on the delimiter pattern. String[] elements = p.split(line); for (String element : elements) { System.out.println(element); } } }cat dog rabbit 100
Pattern.COMMENTS
With this flag we can use comments in a regular expression pattern. This can help make larger regular expressions easier to read and maintain.
import java.util.regex.Matcher; import java.util.regex.Pattern; public class Program { final static String example = "#Match line string\n" + "line\\W" + "#Match one or more digits and a separator\n" + "\\d+\\W+" + "#Match one or more word chars\n" + "\\w+"; public static void main(String[] args) { // Compile this pattern with COMMENTS. // ... Whitespace is ignored and comments are allowed. Pattern pattern = Pattern.compile(example, Pattern.COMMENTS); // This line with succeed. Matcher m = pattern.matcher("line 123: BIRD"); if (m.matches()) { System.out.println(m.toString()); } // This will not succeed. m = pattern.matcher("test failure"); if (m.matches()) { System.out.println(false); // Not reached. } } }java.util.regex.Matcher[pattern=#Match line string line\W#Match one or more digits and a separator \d+\W+#Match one or more word chars \w+ region=0,14 lastmatch=line 123: BIRD]
Matches
This method receives a Regex
string
. If the pattern we supply matches the string
we call matches()
on, we get a true result. Otherwise it returns false.
Matches()
is the same as calling Pattern.matches
directly. But this syntax may be easier to use in programs.public class Program { public static void main(String[] args) { String value = "carrots"; // This regular expression matches. boolean result1 = value.matches("c.*s"); System.out.println(result1); // This regular expression does not match. boolean result2 = value.matches("c.*x"); System.out.println(result2); } }true false
With compile()
we reuse the same pattern many times. Does this gives us a significant performance advantage over just calling matches?
Pattern.compile
(and the matches method).Pattern.matches
, but do not use the compile method. This is a non-compiled Regex
.compile()
and a Matcher is a clear performance boost. This approach is faster than Pattern.matches
.import java.util.regex.Matcher; import java.util.regex.Pattern; public class Program { public static void main(String[] args) throws Exception { // ... Compile. Pattern pattern = Pattern.compile("num\\d\\d\\d"); long t1 = System.currentTimeMillis(); // Version 1: use Matcher with compiled pattern. for (int i = 0; i < 100000; i++) { Matcher m = pattern.matcher("num123"); if (!m.matches()) { throw new Exception(); } } long t2 = System.currentTimeMillis(); // Version 2: use Pattern.matches method. for (int i = 0; i < 100000; i++) { if (!Pattern.matches("num\\d\\d\\d", "num123")) { throw new Exception(); } } long t3 = System.currentTimeMillis(); // ... Times. System.out.println(t2 - t1); System.out.println(t3 - t2); } }31 ms, Pattern.compile, Matcher 90 ms, Pattern.matches
We can reference groups with names or indexes using the group method on Matcher. Is it faster to access a group by an index, instead of a name?
string
like "digitpart."import java.util.regex.Matcher; import java.util.regex.Pattern; public class Program { public static void main(String[] args) { // ... Compile. Pattern pattern1 = Pattern .compile("(?<digitpart>\\d\\d),(?<letterpart>\\s+)"); Pattern pattern2 = Pattern.compile("(\\d\\d),(\\s+)"); long t1 = System.currentTimeMillis(); // Version 1: use pattern with named groups. for (int i = 0; i < 200000; i++) { Matcher m = pattern1.matcher("34,cat"); if (m.matches()) { String part1 = m.group("digitpart"); String part2 = m.group("letterpart"); if (part1 != "34" || part2 != "cat") { System.out.println(false); break; } } } long t2 = System.currentTimeMillis(); // Version 2: use pattern with indexed (ordinal) groups. for (int i = 0; i < 200000; i++) { Matcher m = pattern2.matcher("34,cat"); if (m.matches()) { String part1 = m.group(1); String part2 = m.group(2); if (part1 != "34" || part2 != "cat") { System.out.println(false); break; } } } long t3 = System.currentTimeMillis(); // ... Times. System.out.println(t2 - t1); System.out.println(t3 - t2); } }44 ms, group(name) 21 ms, group(index)
A regular expression can be used to count words. The split()
method is helpful here. But a faster option is to use a for
-loop.
This is perhaps the most used document format in the world. With a Regex
we can manipulate simple HTML tags. But this is not a general-purpose solution.
When complexity builds, writing custom loops becomes a challenge. With Regex
we simplify programs. We make them easier to write and understand.