In Python we access regular expressions through the "re" library. We call methods like re.match()
to test for patterns.
With match and search we evaluate regular expressions. More advanced methods like groupdict can process groups. Findall handles multiple matches—it returns a list.
Match
exampleThis program uses a regular expression in a loop. It applies a for
-loop over the elements in a list. In the loop body, we call re.match()
.
groups()
returns a tuple containing the text content that matches the pattern.import re # Sample strings. list = ["dog dot", "data day", "no match"] # Loop. for element in list: # Match if 2 words starting with letter "d." m = re.match(r"(d\w+)\W(d\w+)", element) # See if success. if m: print(m.groups())('dog', 'dot') ('data', 'day')
This method is different from match. Both apply a pattern. But search attempts this at all possible starting points in the string. Match
just tries the first starting point.
import re # Input. value = "voorheesville" m = re.search(r"(vi.*)", value) if m: # This is reached. print("search:", m.group(1)) m = re.match(r"(vi.*)", value) if m: # This is not reached. print("match:", m.group(1))search: ville
Split
The re.split()
method accepts a pattern argument. This pattern specifies the delimiter. With it, we can use any text that matches a pattern as the delimiter to separate text data.
string
on one or more non-digit characters. The regular expression is described after the script output.import re # Input string. value = "one 1 two 2 three 3" # Separate on one or more non-digit characters. result = re.split(r"\D+", value) # Print results. for element in result: if element != "": print(element)1 2 3
This is similar to split()
. Findall accepts a pattern that indicates which strings to return in a list. It is like split()
but we specify matching parts, not delimiters.
string
for all words starting with the individual letters "dp," and with one or more following word characters.import re # Input. value = "abc 123 def 456 dot map pat" # Find all words starting with d or p. list = re.findall(r"[dp]\w+", value) # Print result. print(list)['def', 'dot', 'pat']
Unlike re.findall
, which returns strings, finditer returns matches. For each match, we call methods like start()
or end()
. And we can access the value of the match with group()
.
import re value = "123 456 7890" # Loop over all matches found. for m in re.finditer(r"\d+", value): print(m.group(0)) print("start index:", m.start())123 start index: 0 456 start index: 4 7890 start index: 8
We can use special characters in an expression to match the start and end of a string
. For the start, we use the character "^" and for the end, we use the "$" sign.
re.match
. We detect all the strings that start or end with a digit character "\d."string
. So to test the end, we use ".*" to handle these initial characters.import re list = ["123", "4cat", "dog5", "6mouse"] for element in list: # See if string starts in digit. m = re.match(r"^\d", element) if m: print("START:", element) # See if string ends in digit. m = re.match(r".*\d$", element) if m: print(" END:", element)START: 123 END: 123 START: 4cat END: dog5 START: 6mouse
Here we match strings with three letters or three dashes at their starts. And the final three characters must be digits. We use non-capturing groups with the "?:" syntax.
string
.import re values = ["cat100", "---200", "xxxyyy", "jjj", "box4000", "tent500"] for v in values: # Require 3 letters OR 3 dashes. # ... Also require 3 digits. m = re.match(r"(?:(?:\w{3})|(?:\-{3}))\d\d\d$", v) if m: print(" OK:", v) else: print("FAIL:", v) OK: cat100 OK: ---200 FAIL: xxxyyy FAIL: jjj FAIL: box4000 FAIL: tent500
A regular expression can have named groups. This makes it easier to retrieve those groups after calling match()
. But it makes the pattern more complex.
string
"first" and the groups()
method. We use "last" for the last name.import re # A string. name = "Roberta Alden" # Match with named groups. m = re.match(r"(?P<first>\w+)\W+(?P<last>\w+)", name) # Print groups using names as id. if m: print(m.group("first")) print(m.group("last"))Roberta Alden
A regular expression with named groups can fill a dictionary. This is done with the groupdict()
method. In the dictionary, each group name is a key.
import re name = "Roberta Alden" # Match names. m = re.match(r"(?P<first>\w+)\W+(?P<last>\w+)", name) if m: # Get dict. d = m.groupdict() # Loop over dictionary with for-loop. for t in d: print(" key:", t) print("value:", d[t]) key: last value: Alden key: first value: Roberta
Sometimes a regular expression is confusing. A comment can be used to explain a complex part. One problem is the comment syntax may be confusing too—this should be considered.
Regex
comment starts with a "#" character (just like in Python itself).import re data = "bird frog" # Use comments inside a regular expression. m = re.match(r"(?#Before part).+?(?#Separator)\W(?#End part)(.+)", data) if m: print(m.group(1))frog
We use a negative match pattern to ensure a value does not match. In this example, we match all the 3-digit strings except ones that are followed by a "dog" string
.
import re data = "100cat 200cat 300dog 400cat 500car" # Find all 3-digit strings except those followed by "dog" string. # ... Dogs are not allowed. m = re.findall(r"(?!\d\d\ddog)(\d\d\d)", data) print(m)['100', '200', '400', '500']
Regular expressions often hinder performance in programs. I tested the in
-operator on a string
against the re.search
method.
in
-operator to find the letter "x" in the string.re.search
(a regular expression method) to find the same letter.in
-operator was much faster than the re.search
method. For searching with no pattern, prefer the in
-operator.import time import re input = "max" if "x" in input: print(1) if re.search("x", input): print(2) print(time.time()) # Version 1: in. c = 0 i = 0 while i < 1000000: if "x" in input: c += 1 i += 1 print(time.time()) # Version 2: re.search. i = 0 while i < 1000000: if re.search("x", input): c += 1 i += 1 print(time.time())1 2 1381081435.177 1381081435.615 [in = 0.438 s] 1381081437.224 [re.search = 1.609 s]
In another test I rewrote a method that uses re.match
to use if
-statements and a for
-loop. It became much faster.
stringmatch
returns after finding an invalid length or an invalid start character.for
-loop and test characters with Python statements.import re import time def stringmatch(s): # Check for "CA+T" with if-statements and loop. if len(s) >= 3 and s[0] == 'C' and s[len(s) - 1] == 'T': for v in range(1, len(s) - 2): if s[v] != 'A': return False return True return False def stringmatch_re(s): # Check for "CA+T" with re. m = re.match(r"CA+T", s) if m: return True return False print(time.time()) # Version 1: use string loop with if-statement. for i in range(0, 10000000): result = stringmatch("CT") result = stringmatch("CAAT") result = stringmatch("DOOOG") print(time.time()) # Version 2: use re.match. for i in range(0, 10000000): result = stringmatch_re("CT") result = stringmatch_re("CAAT") result = stringmatch_re("DOOOG") print(time.time())1726672414.6579642 1726672417.3540194 stringmatch = 2.69 s 1726672428.6217597 stringmatch_re = 11.26 s
A regular expression is often hard to correctly write. But when finished, it is shorter and overall simpler to maintain. It describes a specific type of logic.