Remove
HTMLSometimes a string
has HTML characters in it that are not desired—they may have been entered by mistake. It is straightforward to remove common HTML tags with a Rust function.
By checking each character for the angle brackets, we can detect when markup starts and ends. Then we can avoid adding those characters to a string
copy.
To begin, we introduce a strip_html
function that receives a str
reference, and returns a new string
. The result of this function is a string
with no HTML markup.
for
-loop over the chars in the string. Chars()
returns an iterator of the individual characters.string
.string
. This contains all characters in the source string
excluding markup regions.fn strip_html(source: &str) -> String { let mut data = String::new(); let mut inside = false; // Step 1: loop over string chars. for c in source.chars() { // Step 2: detect markup start and end, and skip over markup chars. if c == '<' { inside = true; continue; } if c == '>' { inside = false; continue; } if !inside { // Step 3: push other characters to the result string. data.push(c); } } // Step 4: return string. return data; } fn main() { // Use the strip html function to remove markup. let input = "<p>Hello <b>world</b>!</p>"; let result = strip_html(input); println!("{input}"); println!("{result}"); }<p>Hello <b>world</b>!</p> Hello world!
It is easy to determine that the function correctly removes simple HTML tags. A problem would be tags inside of HTML comments—a more complex function would be needed to support this.
It is possible to remove HTML tags from a string
in Rust, and regular expressions are not needed. The code is small and easy to maintain.