String Remove HTML String split Word Count

`Regex`

In Java regexes, we match strings to patterns. We can match the string "c.t" with "cat." We use things like Pattern.compile and Matcher.

Regular expressions often reduce performance. The special text language used has some costs. But in many places, regular expressions are an overall improvement.

`Pattern.matches` example

We call Pattern.matches in a loop. Its first argument is the regular expression's pattern. It also accepts the string we want to test for matches.

And It returns a boolean. If a match was found, this value equals true. For groups, we need to instead use a Matcher.

import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        // Some strings to test.
        String[] inputs = { "dog", "dance", "cat", "dirt" };

        // Loop over strings and test them.
        for (String input : inputs) {
            boolean b = Pattern.matches("d.+", input);
            System.out.println(b);
        }
    }
}true
true
false
true

`Pattern.compile` and Matcher

Next we learn a faster way to match regular expressions. We use Pattern.compile to create a compiled pattern object.

Then We call the matcher() method on the pattern instance. This returns a Matcher class instance.

Detail Finally the matches method is used. This returns true if the matcher has a match of the compiled pattern.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        // Compile this pattern.
        Pattern pattern = Pattern.compile("num\\d\\d\\d");

        // See if this String matches.
        Matcher m = pattern.matcher("num123");
        if (m.matches()) {
            System.out.println(true);
        }

        // Check this String.
        m = pattern.matcher("num456");
        if (m.matches()) {
            System.out.println(true);
        }
    }
}true
true

Capturing groups

Often regular expression patterns use groups to capture parts of strings. Here we use positional groups. We access them by their position (1, 2 or more).

Tip We create the compiled Pattern and initialize the Matcher like usual. After calling matches() we access groups.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        Pattern pattern = Pattern.compile("(\\d+)\\-(\\d+)");

        // Get matcher on this String.
        Matcher m = pattern.matcher("1234-5678");

        // If it matches, get and display group values.
        if (m.matches()) {
            String part1 = m.group(1);
            String part2 = m.group(2);

            System.out.println(part1);
            System.out.println(part2);
        }
    }
}1234
5678

Named groups

With names, we easily access specific groups from a matched pattern. We use angle brackets to name groups in the pattern. Then we call group() with a String name argument.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        // Specify a pattern with named groups.
        Pattern pattern = Pattern.compile("(?<first>..)x(?<second>..)");
        Matcher m = pattern.matcher("c3xp0");

        // Check for matches.
        // ... Then access named groups by their names.
        if (m.matches()) {
            String part1 = m.group("first");
            String part2 = m.group("second");

            System.out.println(part1);
            System.out.println(part2);
        }
    }
}c3
p0

`Pattern.quote`

Characters must be escaped ("quoted") to avoid being seen as metacharacters. For example a star must be escaped to mean an asterisk, not a Kleene closure of "zero or more."

Detail This method surrounds a String with a Q and an E. Between these characters, everything is escaped.

So We match the star as a star. Without Pattern.quote, we receive a "dangling metacharacter" exception.

import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        // Quote this value.
        String value = "*star";
        String quote = Pattern.quote(value);

        System.out.println(value);
        System.out.println(quote);

        // Try matching with quoted value.
        boolean result1 = Pattern.matches(quote, "*star");
        System.out.println(result1);

        // This fails because it was not quoted.
        boolean result2 = Pattern.matches(value, "*star");
        System.out.println(result2);
    }
}*star
\Q*star\E
true
Exception in thread "main" java.util.regex.PatternSyntaxException:
    Dangling meta character '*' near index 0
    *star
    ^

Start, end in pattern

Often in regular expressions we want to match the start or end of strings. Two metacharacters are useful here: the "^" and the "$." These match the start, the end.

Here A method called startsWithAEndsWithZ tests a String. It returns true if the first char is "a" and the last is "z."

Warning Testing chars (with startsWith, endsWith, charAt) is more efficient. But it becomes harder to code when requirements change.

import java.util.regex.Pattern;

public class Program {

    public static boolean startsWithAEndsWithZ(String value) {
        // Test start and end characters.
        return Pattern.matches("^a.*z$", value);
    }

    public static void main(String[] args) {
        String[] values = { "a123z", "b123z", "az", "aq", "aza" };
        // Loop over and test these Strings.
        for (String value : values) {
            System.out.print(value);
            System.out.print(' ');
            System.out.println(startsWithAEndsWithZ(value));
        }
    }
}a123z true
b123z false
az true
aq false
aza false

`Split`, Pattern

A split method is available on Pattern instances. This lets us split based on a Regex delimiter. The Pattern can be compiled once and reused many times.

import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        String line = "cat, dog, rabbit--100";

        // Compile a Pattern that indicates a delimiter.
        Pattern p = Pattern.compile("\\W+");

        // Split a String based on the delimiter pattern.
        String[] elements = p.split(line);
        for (String element : elements) {
            System.out.println(element);
        }
    }
}cat
dog
rabbit
100

`Pattern.COMMENTS`

With this flag we can use comments in a regular expression pattern. This can help make larger regular expressions easier to read and maintain.

Tip We must have comments that start with a pound sign (hash) and end in a newline.

Tip 2 With comments mode whitespace is ignored. So we can specify spaces with "\W" to indicate non-word characters.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Program {

    final static String example =
            "#Match line string\n" +
            "line\\W" +
            "#Match one or more digits and a separator\n" +
            "\\d+\\W+" +
            "#Match one or more word chars\n" +
            "\\w+";

    public static void main(String[] args) {

        // Compile this pattern with COMMENTS.
        // ... Whitespace is ignored and comments are allowed.
        Pattern pattern = Pattern.compile(example, Pattern.COMMENTS);

        // This line with succeed.
        Matcher m = pattern.matcher("line 123: BIRD");
        if (m.matches()) {
            System.out.println(m.toString());
        }

        // This will not succeed.
        m = pattern.matcher("test failure");
        if (m.matches()) {
            System.out.println(false); // Not reached.
        }
    }
}java.util.regex.Matcher[pattern=#Match line string
line\W#Match one or more digits and a separator
\d+\W+#Match one or more word chars
\w+ region=0,14 lastmatch=line 123: BIRD]

`Matches`

This method receives a Regex string. If the pattern we supply matches the string we call matches() on, we get a true result. Otherwise it returns false.

Note Matches() is the same as calling Pattern.matches directly. But this syntax may be easier to use in programs.

public class Program {
    public static void main(String[] args) {

        String value = "carrots";

        // This regular expression matches.
        boolean result1 = value.matches("c.*s");
        System.out.println(result1);

        // This regular expression does not match.
        boolean result2 = value.matches("c.*x");
        System.out.println(result2);
    }
}true
false

Benchmark, compile

With compile() we reuse the same pattern many times. Does this gives us a significant performance advantage over just calling matches?

Version 1 This version of the code measures the performance of using Pattern.compile (and the matches method).

Version 2 Here we call Pattern.matches, but do not use the compile method. This is a non-compiled Regex.

Result Using compile() and a Matcher is a clear performance boost. This approach is faster than Pattern.matches.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) throws Exception {

        // ... Compile.
        Pattern pattern = Pattern.compile("num\\d\\d\\d");

        long t1 = System.currentTimeMillis();

        // Version 1: use Matcher with compiled pattern.
        for (int i = 0; i < 100000; i++) {
            Matcher m = pattern.matcher("num123");
            if (!m.matches()) {
                throw new Exception();
            }
        }

        long t2 = System.currentTimeMillis();

        // Version 2: use Pattern.matches method.
        for (int i = 0; i < 100000; i++) {
            if (!Pattern.matches("num\\d\\d\\d", "num123")) {
                throw new Exception();
            }
        }

        long t3 = System.currentTimeMillis();

        // ... Times.
        System.out.println(t2 - t1);
        System.out.println(t3 - t2);
    }
}31 ms, Pattern.compile, Matcher
90 ms, Pattern.matches

Benchmark, named groups

We can reference groups with names or indexes using the group method on Matcher. Is it faster to access a group by an index, instead of a name?

Version 1 This version of the code accesses groups by their names. We access a group with a string like "digitpart."

Version 2 Here we access groups by an index like 1 or 2. The returned values of using indexes are the same as named groups.

Result In this test, named accesses are slower. Using indexes, like 1 or 2, is faster. For speed, it is better to use indexes.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Program {
    public static void main(String[] args) {

        // ... Compile.
        Pattern pattern1 = Pattern
                .compile("(?<digitpart>\\d\\d),(?<letterpart>\\s+)");
        Pattern pattern2 = Pattern.compile("(\\d\\d),(\\s+)");

        long t1 = System.currentTimeMillis();

        // Version 1: use pattern with named groups.
        for (int i = 0; i < 200000; i++) {
            Matcher m = pattern1.matcher("34,cat");
            if (m.matches()) {
                String part1 = m.group("digitpart");
                String part2 = m.group("letterpart");
                if (part1 != "34" || part2 != "cat") {
                    System.out.println(false);
                    break;
                }
            }
        }

        long t2 = System.currentTimeMillis();

        // Version 2: use pattern with indexed (ordinal) groups.
        for (int i = 0; i < 200000; i++) {
            Matcher m = pattern2.matcher("34,cat");
            if (m.matches()) {
                String part1 = m.group(1);
                String part2 = m.group(2);
                if (part1 != "34" || part2 != "cat") {
                    System.out.println(false);
                    break;
                }
            }
        }

        long t3 = System.currentTimeMillis();

        // ... Times.
        System.out.println(t2 - t1);
        System.out.println(t3 - t2);
    }
}44 ms, group(name)
21 ms, group(index)

Word count

A regular expression can be used to count words. The split() method is helpful here. But a faster option is to use a for-loop.

HTML

This is perhaps the most used document format in the world. With a Regex we can manipulate simple HTML tags. But this is not a general-purpose solution.

When complexity builds, writing custom loops becomes a challenge. With Regex we simplify programs. We make them easier to write and understand.

Regex

Pattern.matches example

Pattern.compile and Matcher