It is not possible to completely parse HTML with Python regular expressions, but some parts can be extracted. With re.match
and group()
we can find text within tags.
When using re.match
, we must specify a pattern that matches the entire HTML string
. Then we can specify the tags that must surround the result.
To begin, we specify a string
that contains HTML tags within the Python program. In a real program, this could be loaded from an external file.
string
with "r" to simplify escaping some values.get_first_paragraph()
we invoke re.match
. The pattern matches the start and the end of the string
with metacharacters.import re def get_first_paragraph(html): # Step 2: call re.match. # Capture the characters within the p tags. # Match the start and end of the html string. m = re.match(r"^.*<p>\s*(.+?)\s*</p>.*$", html) # Step 3: return first group item if groups exists. if m: return m.group(1) return "" # Step 1: specify html string and call get_first_paragraph method. html = r"<html><title>...</title><body><p>Result.</p></body></html>" print(get_first_paragraph(html))Result.
Trying to access text from within HTML using Python regular expressions may not be ideal, but it usually works. There can be problems with comments, or mixed-case HTML tags.
HTML is everyone in the modern world, and Python is often used to extract text from external documents. With this logic, we can fetch at least some paragraph text.