We can extract important page elements by scraping HTML. With the Regex
type and WebClient
(in the C# language) we implement screen scraping for HTML.
Regex
notesWe cannot easily parse HTML with regular expressions, but we can extract links and other parts of strings with them fairly well.
We will scrape HTML links from Wikipedia. This is permitted by Wikipedia's GPL license, and this demonstration is fair use.
using System.Diagnostics; using System.Net; class Program { static void Main() { // 1. // URL: http://en.wikipedia.org/wiki/Main_Page WebClient w = new WebClient(); string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page"); // 2. foreach (LinkItem i in LinkFinder.Find(s)) { Debug.WriteLine(i); } } }
This class
that receives the HTML string
and then extracts all the links and their text into structs. It is fairly fast, but I offer some optimization tips further down.
MatchCollection
.List
of LinkItem
objects. This list can then be used in the foreach
-loop from the first C# example.RegexOptions.Singleline
. This is an important option.using System.Collections.Generic; using System.Text.RegularExpressions; public struct LinkItem { public string Href; public string Text; public override string ToString() { return Href + "\n\t" + Text; } } static class LinkFinder { public static List<LinkItem> Find(string file) { List<LinkItem> list = new List<LinkItem>(); // 1. // Find all matches in file. MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline); // 2. // Loop over each match. foreach (Match m in m1) { string value = m.Groups[1].Value; LinkItem i = new LinkItem(); // 3. // Get href attribute. Match m2 = Regex.Match(value, @"href=\""(.*?)\""", RegexOptions.Singleline); if (m2.Success) { i.Href = m2.Groups[1].Value; } // 4. // Remove inner tags from text. string t = Regex.Replace(value, @"\s*<.*?>\s*", "", RegexOptions.Singleline); i.Text = t; list.Add(i); } return list; } }
Program
outputTest the program on your website. It prints out matches to the console. Here we see part of the current results for the Wikipedia home page.
#column-one navigation #searchInput search /wiki/Wikipedia Wikipedia /wiki/Free_content free /wiki/Encyclopedia encyclopedia /wiki/Wikipedia:Introduction anyone can edit /wiki/Special:Statistics 2,617,101 /wiki/English_language English /wiki/Portal:Arts Arts /wiki/Portal:Biography Biography /wiki/Portal:Geography Geography /wiki/Portal:History History /wiki/Portal:Mathematics Mathematics /wiki/Portal:Science Science /wiki/Portal:Society Society /wiki/Portal:Technology_and_applied_sciences Technology<ul> <li><a href="/wiki/Portal:Arts" title="Portal:Arts">Arts</a></li> <li><a href="/wiki/Portal:Biography" title="Portal:Biography">Biography</a></li> <li><a href="/wiki/Portal:Geography" title="Portal:Geography">Geography</a></li> </ul>
You can improve performance of the regular expressions by specifying RegexOptions.Compiled
. Also, you can use instance Regex
objects, not the static
methods.
We scraped HTML content from the Internet. Using 3 regular expressions, you can extract HTML links into objects with a fair degree of accuracy.