Paragraph, HTML

Often VB.NET programs must perform some tasks like simple processing of HTML to find paragraphs. This can be done with regular expressions in certain cases.

With a regular expression, we can extract text from between paragraph "P" tags. And by using the Groups property, we can get this value as a String.

Example

As we begin, please notice that we import the System.Text.RegularExpressions namespace with the Imports keyword. This makes the program compile correctly.

Start We specify an HTML string and pass it to GetFirstParagraph. In real programs, we might read in a file with File.ReadAllText.

Next In GetFirstParagraph, we have some complex regular expression logic. We specify some Kleene closures to access data within paragraph tags.

Tip The star character, meaning zero or more repeats, is a Kleene closure and it matches whitespace and the inner value for us here.

Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim html as String = "<html><title>...</title><body><p>Result.</p></body></html>"
        Console.WriteLine(GetFirstParagraph(html))
    End Sub

    Function GetFirstParagraph(value as String)
        ' Use regular expression to match a paragraph.
        Dim match as Match = Regex.Match(value, "<p>\s*(.+?)\s*</p>")
        If match.Success
            Return match.Groups(1).Value
        Else
            Return ""
        End If
    End Function

End Module
Result.

Some notes

When using the Groups property on a Match result from Regex.Match, it is important to access element 1 for the first group. The collection is one-based, not zero-based.

Accessing inner text values from within HTML strings can be difficult. And regular expressions are not the best solution in all cases, but they can work on simple HTML pages.