Dot Net Perls

Remove HTML Tags From String - C#

by Sam Allen

Problem

You want to remove all HTML tags from your string. This is useful for displaying HTML in plain text and stripping formatting like bold and italics. Compare methods and select the fastest and most appropriate one.

InputOutput
<p>The <b>dog</b> is <i>cute</i>.</p>The dog is cute.

Solution: string manipulation in C#

First, from my performance research I know regular expressions in C# are not fast. I wrote an algorithm that uses a combination of StringBuilder and character tests to strip HTML tags. [C# - StringBuilder Secrets - dotnetperls.com]

Example: methods to remove HTML from strings

Here are three C# methods that receive a string that has HTML tags and return a string that has no tags. HTML tags start with < and end with >.

How it works. My method 1 shown below is a parser that continues through the string and records whether or not it in inside a tag. If it is in a tag, it doesn't save the character.

using System;
using System.Text;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        const string s = "<p>There was a <b>.NET</b> programmer " +
            "and he stripped the <i>HTML</i> tags.</p>";
        Console.WriteLine(StripTags(s));
        Console.WriteLine(StripTagsRegex(s));
        Console.WriteLine(StripTagsRegex2(s));
        Console.ReadLine();
    }

    /// <summary>
    /// 1.
    /// Remove all HTML tags from string.
    /// </summary>
    static string StripTags(string text)
    {
        StringBuilder b = new StringBuilder(text.Length);
        bool inside = false;
        for (int i = 0; i < text.Length; i++)
        {
            char let = text[i];
            if (let == '<')
            {
                inside = true;
                continue;
            }
            if (let == '>')
            {
                inside = false;
                continue;
            }
            if (inside == false)
            {
                b.Append(let);
            }
        }
        return b.ToString();
    }

    /// <summary>
    /// 2.
    /// Common method used to remove all HTML tags from string with Regex.
    /// </summary>
    static string StripTagsRegex(string text)
    {
        return Regex.Replace(text, "<.*?>", string.Empty);
    }

    /// <summary>
    /// Required for method 3.
    /// </summary>
    static Regex _reg = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// 3.
    /// Optimized Regex method to remove all HTML tags from string with Regex.
    /// </summary>
    static string StripTagsRegex2(string text)
    {
        return _reg.Replace(text, string.Empty);
    }
}

The first method goes through each character and records whether it is inside a tag. If it is, the character is not kept. The second two methods use a regex pattern to find tag matches, and replaces those with an empty string.

Information: output from the examples

The three methods work identically on valid HTML. You can see that the first method will strip anything that follows a <, but the latter two will require a > before they strip the tag. Here's the output of the above console program.

There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

Information: credit for these three methods

The first method here I wrote from scratch. It builds on my work with valid XHTML and string parsing routines. The first Regex-based method I adapted from another source. The third method uses tricks from my Regex articles to improve performance. [Strip all HTML tags - csharp-online.net]

Information: benchmarks of the three methods

The first method was clearly the fastest in my benchmark. My optimized Regex method 3 is about 70% faster than the Regex method 2. My benchmark consisted of 1 million iterations for each.

Interpreting the results. In a release build of .NET 3.5, the approach that doesn't use Regex performs better than all the Regex approaches. Using RegexOptions.Compiled and a separate Regex object helps make #3 much faster than #2.

RegexOptions.Compiled has some drawbacks, however. It can reduce startup time by 10x in some cases. More material is available pertaining to make Regexes simpler and faster to run. [C# - Regex Improvement - dotnetperls.com]

Summary: removing HTML markup from strings

Strip HTML tags from strings with any of these methods. They all have the same results on most input, but the first method here is much faster. Here I showed a way to remove HTML tags that is faster than the more common ones.

Dot Net Perls
About
Sitemap
Source code
RSS
Strings
Split String Examples
IndexOf String Examples
Remove HTML Tags From String
Count Characters in String
Uppercase First Letter in String
Recent
Pi
NGEN Installer Class
List Element Equality
DateTime Tips and Tricks
Remove HTML Tags From String
© 2008 Sam Allen. All rights reserved.