Remove
HTML tagsHTML is used extensively on the Internet. But HTML tags themselves are sometimes not helpful when processing text.
We can remove HTML tags, and HTML comments, with Python and the re.sub
method. The code does not handle every possible case—use it with caution.
This program imports the re module for regular expression use. This code is not versatile or robust, but it does work on simple inputs.
string
has some HTML tags, including nested tags. Closing tags are also included.re.sub
with a special pattern as the first argument. Matches
are replaced with an empty string
(removed).string
is not treated as one huge HTML tag.import re # Part 1: this string contains HTML. v = "<p id=1>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p>" # Part 2: replace HTML tags with an empty string. result = re.sub("<.*?>", "", v) print(result)Sometimes, simpler is better, but not always.
HTML pages often contain comments. These can contain any text, including other comments and HTML tags. This code removes comments, but it does not handle all possible cases.
import re # This HTML string contains two comments. v = """<p>Welcome to my <!-- awesome --> website<!-- bro --></p>""" # Remove HTML comments. result = re.sub("<!--.*?-->", "", v) print(v) print(result)<p>Welcome to my <!-- awesome --> website<!-- bro --></p> <p>Welcome to my website</p>
For web browsers, advanced parsers with error correction are used. This makes them more compatible on real web pages, but implementing that logic is challenging.
With the re.sub
method, we remove certain parts of strings. The regular expression argument can be used to match HTML tags, or HTML comments, in a fairly accurate way.