Home
Java
String Remove HTML Tags
Updated Dec 24, 2024
Dot Net Perls
HTML tags. HTML is the universal language of web page markup. But when processing files, it often helps to remove tags and deal directly with text.
With advanced parsers, we can handle nearly any HTML, even invalid HTML. But this is complex. With a simple replaceAll call, we can strip some HTML—this is limited but effective.
This program contains two important methods. Both methods work on trivial HTML sources. On comments, and unusual markup, they may (and often will) fail.
Info StripHtmlRegex uses replaceAll. The first argument is a regular expression, and the second is the replacement.
String replace
Next The char array method implements a simple imperative parser in a for-loop. It changes state based on angle brackets.
char Array
for
Info The 2 methods have the same, correct, output on the example string. In main() we test them.
public class Program { public static String stripHtmlRegex(String source) { // Replace all tag characters with an empty string. return source.replaceAll("<.*?>", ""); } public static String stripTagsCharArray(String source) { // Create char array to store our result. char[] array = new char[source.length()]; int arrayIndex = 0; boolean inside = false; // Loop over characters and append when not inside a tag. for (int i = 0; i < source.length(); i++) { char let = source.charAt(i); if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } // ... Return written data. return new String(array, 0, arrayIndex); } public static void main(String[] args) { final String html = "<p id=x>Sometimes, <b>simpler</b> is better, " + "but <i>not</i> always.</p>"; System.out.println(html); String test = stripHtmlRegex(html); System.out.println(test); String test2 = stripTagsCharArray(html); System.out.println(test2); } }
<p id=x>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p> Sometimes, simpler is better, but not always. Sometimes, simpler is better, but not always.
Not ideal. To be clear, these methods are not ideal. For example, neither method has support for HTML markup nested within comments. They can corrupt correct pages.
For HTML, web browser developers create complex and optimized parsers. An HTML parser is more than one line of Java code. Many features are not supported with these methods.
And Due to the complex, organic nature of the web, these HTML methods can be used only on a limited subset of pages.
In software, we often prefer the simplest solution for our needs. In a situation where only simple HTML constructs are found, the first method with replaceAll is useful.
Dot Net Perls is a collection of pages with code examples, which are updated to stay current. Programming is an art, and it can be learned from examples.
Donate to this site to help offset the costs of running the server. Sites like this will cease to exist if there is no financial support for them.
Sam Allen is passionate about computer languages, and he maintains 100% of the material available on this website. He hopes it makes the world a nicer place.
This page was last updated on Dec 24, 2024 (simplify).
Home
Changes
© 2007-2025 Sam Allen