Get Paragraph From HTMLUse the re.match method to get the text of paragraphs from HTML. Print the contents of the paragraph as a string.
This page was last reviewed on Nov 8, 2023.
Paragraph, HTML. It is not possible to completely parse HTML with Python regular expressions, but some parts can be extracted. With re.match and group() we can find text within tags.
When using re.match, we must specify a pattern that matches the entire HTML string. Then we can specify the tags that must surround the result.
Title HTML
Example. To begin, we specify a string that contains HTML tags within the Python program. In a real program, this could be loaded from an external file.
Step 1 This is the HTML data we are trying to parse. We specify this as a raw string with "r" to simplify escaping some values.
String Literals
Step 2 In getfirstparagraph() we invoke re.match. The pattern matches the start and the end of the string with metacharacters.
Step 3 We check the returned match value, and then access the first group (which is text from the first parenthesis grouping).
import re def getfirstparagraph(html): # Step 2: call re.match. # Capture the characters within the p tags. # Match the start and end of the html string. m = re.match(r"^.*<p>\s*(.+?)\s*</p>.*$", html) # Step 3: return first group item if groups exists. if m: return m.group(1) return "" # Step 1: specify html string and call getfirstparagraph method. html = r"<html><title>...</title><body><p>Result.</p></body></html>" print(getfirstparagraph(html))
Some notes. Trying to access text from within HTML using Python regular expressions may not be ideal, but it usually works. There can be problems with comments, or mixed-case HTML tags.
So Be careful not to apply this style of code to all HTML documents, just a known set of them where it matches correctly.
Summary. HTML is everyone in the modern world, and Python is often used to extract text from external documents. With this logic, we can fetch at least some paragraph text.
Dot Net Perls is a collection of tested code examples. Pages are continually updated to stay current, with code correctness a top priority.
Sam Allen is passionate about computer languages. In the past, his work has been recommended by Apple and Microsoft and he has studied computers at a selective university in the United States.
This page was last updated on Nov 8, 2023 (new).
© 2007-2023 Sam Allen.