Python HTML: HTMLParser, Read Markup

This Python article introduces the html.parser module. It uses HTMLParser and implements a simple class.
HTML. In HTML, we find tags, attributes and data. We could write custom methods to parse these. But in Python we can instead use the HTMLParser class from the html.parser module. We derive a class from HTMLParser to add more features.
Example. This example implements a class that derives from HTMLParser. It uses the inheritance syntax. This TagParser class is not fully effective on some HTML documents. It works on some tags, like title tags, but not with nested elements.Class

Methods: In the class, we specify 2 methods: handle_starttag and handle_data. Other methods can be specified.

Here: We just set a field "tag" to the name of the current start tag in handle_starttag.

And: Then when we encounter data, in handle_data, we use the previous tag name to help identify that data.

Caution: This approach is not ideal, but if you are just searching for simple tags, like title or h1 elements, it works.

Python program that uses html.parser from html.parser import HTMLParser # A class that inherits from HTMLParser. # ... It implements two methods. class TagParser(HTMLParser): def handle_starttag(self, tag, attrs): # Set "tag" field to the name of the opened tag. self.tag = tag def handle_data(self, data): # Print data within currently-open tag. print(self.tag + ":", data) parser = TagParser() parser.feed("<h1>Python</h1>" + "<p>Is cool.</p>"); Output h1: Python p: Is cool.
Feed. We call the feed method on the HTMLParser instance. With feed, we "feed" string data to the parser. It then internally reads the characters in the string. And it calls your specified methods, if the required elements are found.

Tip: You can specify any Python statements within your class that derives from HTMLParser.

And: This makes it possible to develop a custom HTML parser. It erases the need to handle tedious HTML syntax in custom code.

Methods. There are many methods on HTMLParser that you can specify. Attributes are received as attrs in the handle_starttag method: this is a list of tuples. More detailed examples for attributes (and comments) are available on the Python site.html.parser: python.org

Tip: You can loop over the attributes (attrs) list like any other list. The for-loop is ideal.

List
Summary. HTML markup is far from trivial to parse. HTML is common. And for this reason many edge cases have emerged: few parsers can handle them all. Using a prebuilt class, like HTMLParser, makes building a special parser in Python easier.Remove HTML Tags
© 2007-2019 Sam Allen. Every person is special and unique. Send bug reports to info@dotnetperls.com.
HomeSearch
Home
Dot Net Perls