Remove
HTML tagsA C# string
may contain HTML tags, and we want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics.
A Regex
cannot handle all HTML documents. An iterative solution, with a for
-loop, may be best in many cases: always test methods.
for
-loop can be used to validate HTML to see if it is mostly correct (whether its tags have correct syntax).Here is a class
that tests 3 ways of removing HTML tags and their contents. The methods process an HTML string
and return new strings that have no HTML tags.
StripTagsRegex
uses a static
call to Regex.Replace
, and therefore the expression is not compiled.StripTagsRegexCompiled
specifies that all sequences matching tag chars with any number of characters are removed.StripTagsCharArray
is an optimized, iterative method. In most benchmarks, this method is faster than Regex
.using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "<p>Hello <b>world</b>!</p>"; Console.WriteLine(StripTagsRegex(html)); Console.WriteLine(StripTagsRegexCompiled(html)); Console.WriteLine(StripTagsCharArray(html)); } /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }Hello world! Hello world! Hello world!
Regular expressions are usually not the fastest way to process text. Char
arrays and the string
constructor can be used instead—this often performs better.
string
returned by GetHtml()
.char
-array method that loops and tests characters, appending to a buffer as it goes along.char
array method was considerably faster—in 2020, using a char
array is still a good choice.using System; using System.Diagnostics; using System.Linq; using System.Text.RegularExpressions; class Program { static void Main() { string html = GetHtml(); const int m = 10000; Stopwatch s1 = Stopwatch.StartNew(); // Version 1: use Regex. for (int i = 0; i < m; i++) { if (StripTagsRegex(html) == null) { return; } } s1.Stop(); Stopwatch s2 = Stopwatch.StartNew(); // Version 2: use Regex Compiled. for (int i = 0; i < m; i++) { if (StripTagsRegexCompiled(html) == null) { return; } } s2.Stop(); Stopwatch s3 = Stopwatch.StartNew(); // Version 3: use char array. for (int i = 0; i < m; i++) { if (StripTagsCharArray(html) == null) { return; } } s3.Stop(); Console.WriteLine(s1.ElapsedMilliseconds); Console.WriteLine(s2.ElapsedMilliseconds); Console.WriteLine(s3.ElapsedMilliseconds); } static string GetHtml() { var result = Enumerable.Repeat("<p><b>Hello, friend,</b> how are you?</p>", 100); return string.Join("", result); } public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }1086 ms StripTagsRegex 694 ms StripTagsRegexCompiled 54 ms StripTagsCharArray
In XHTML, some elements have no separate closing tag, and instead use the "/" at the end of the first tag. The methods tested on this page correctly handle self-closing tags.
Regex
methods.<img src="" /> <img src=""/> <br /> <br/> < div > <!-- -->
Here is a way to validate XHTML using methods similar to StripTagsCharArray
. We count the number of tag chars and make sure the counts match.
Regex
methods and then look for tag bracket characters that are still present.using System; class Program { static void Main() { // Test the IsValid method. Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>")); Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html")); Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>")); Console.WriteLine(HtmlUtil.IsValid("<<>>")); Console.WriteLine(HtmlUtil.IsValid("")); } } static class HtmlUtil { enum TagType { SmallerThan, // "<" GreaterThan // ">" } public static bool IsValid(string html) { TagType expected = TagType.SmallerThan; // Must start with "<" for (int i = 0; i < html.Length; i++) // Loop { bool smallerThan = html[i] == '<'; bool greaterThan = html[i] == '>'; if (!smallerThan && !greaterThan) // Common case { continue; } if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue { expected = TagType.GreaterThan; continue; } if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue { expected = TagType.SmallerThan; continue; } return false; // Disallow } return expected == TagType.SmallerThan; // Must expect "<" } }True False True False True
The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup.
char
arraysOne method (StripTagsCharArray
) uses char
arrays. It is much faster than the other 2 methods. It uses an algorithm for parsing the HTML.
Boolean
depending on whether it is inside a tag block.char
to the array if it is not a tag. It uses char
arrays and the string
constructor.Char
arrays are faster than using StringBuilder
. But StringBuilder
can be used with similar results.Several methods can strip HTML tags from strings or files. These methods have the same results on the input. But the iterative method is faster in the test here.