Dot Net Perls

Scraping HTML Links in C#

by Sam Allen - Updated June 28, 2009

Problem. You want to scrape all HTML links and HREFs from specific web pages, for maintenance or validation of your site. This is called screen scraping, and it has many legal uses for webmasters and ASP.NET developers. Solution. Here we look at how you can implement screen scraping for HTML using the C# programming language.

--- HTML scraping tip: ---

    You can scrape HTML links.
    Regular expressions are ideal for this purpose.
    More than one Regex can be used.

1. Using System.Net

First, for my demonstration I will scrape HTML links from Wikipedia.org. This is permitted by Wikipedia's GPL license, and this demonstration is fair use. Here we see code that downloads the English Wikipedia page. What it does is open a connection to Wikipedia.org and download the content at the specified URL. Part 2 uses my special code to loop over each link and its text.

~~~ Program that scrapes HTML (C#) ~~~

using System.Diagnostics;
using System.Net;

class Program
{
    static void Main()
    {
        // Scrape links from wikipedia.org

        // 1.
        // URL: http://en.wikipedia.org/wiki/Main_Page
        WebClient w = new WebClient();
        string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page");

        // 2.
        foreach (LinkItem i in LinkFinder.Find(s))
        {
            Debug.WriteLine(i);
        }
    }
}

2. Using regular expressions

Here I show a simple class that receives the HTML string and then extracts all the links and their text into structs. It is fairly fast, but I offer some optimization tips further down. It would be better to use a class here and offer methods that act on its contents.

using System.Collections.Generic;
using System.Text.RegularExpressions;

public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
        return Href + "\n\t" + Text;
    }
}

static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
        List<LinkItem> list = new List<LinkItem>();

        // 1.
        // Find all matches in file.
        MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
            RegexOptions.Singleline);

        // 2.
        // Loop over each match.
        foreach (Match m in m1)
        {
            string value = m.Groups[1].Value;
            LinkItem i = new LinkItem();

            // 3.
            // Get href attribute.
            Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
            if (m2.Success)
            {
                i.Href = m2.Groups[1].Value;
            }

            // 4.
            // Remove inner tags from text.
            string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
            i.Text = t;

            list.Add(i);
        }
        return list;
    }
}

Description of the code. In step 1 it finds all hyperlink tags. We store all the complete A tags into a MatchCollection. These are objects that store the complete HTML strings.

Next steps. In step 2 it loops over all hyperlink tag strings. In the algorithm, the next part examines all the text of the A tags. This is necessary for reading the parts of the A tags. For each A tag, it reads in the HREF attribute. This attribute points to other web resources. This part is not failsafe, but works almost always.

Final steps. Finally, the method returns the List of LinkItem objects it has built up. This list can then be used in the foreach loop from the first C# example. The ToString method override above simply provides a standard way of printing the links.

3. Testing the method

My first two attempts at this code were incorrect and had unacceptable bugs, but the version shown here seems to work very well. You need to use RegexOptions.SingleLine. In .NET, the dot in a Regex matches all characters except a newline unless this is specified. To match multiline links, we require RegexOptions.Singleline. [RegexOptions Enumeration - MSDN]

4. Evaluating the program

Run the program on your website and it will print out the matches to the console. Here we see part of the current results for the Wikipedia home page. The original HTML shows where the links were extracted. They are contained in a LI tag. You will see my program successfully extracted the anchor text and also the HREF value.

--- Output of the program ---

#column-one
    navigation
#searchInput
    search
/wiki/Wikipedia
    Wikipedia
/wiki/Free_content
    free
/wiki/Encyclopedia
    encyclopedia
/wiki/Wikipedia:Introduction
    anyone can edit
/wiki/Special:Statistics
    2,617,101
/wiki/English_language
    English
/wiki/Portal:Arts
    Arts
/wiki/Portal:Biography
    Biography
/wiki/Portal:Geography
    Geography
/wiki/Portal:History
    History
/wiki/Portal:Mathematics
    Mathematics
/wiki/Portal:Science
    Science
/wiki/Portal:Society
    Society
/wiki/Portal:Technology_and_applied_sciences
    Technology

--- Original website HTML ---

<ul>
<li><a href="/wiki/Portal:Arts" title="Portal:Arts">Arts</a></li>
<li><a href="/wiki/Portal:Biography" title="Portal:Biography">Biography</a></li>
<li><a href="/wiki/Portal:Geography" title="Portal:Geography">Geography</a></li>

</ul>

5. Using SingleLine mode

Many C# developers make the mistake of not specifying that the Regexes work on multiple lines, treating newlines as regular characters. MSDN states that SingleLine "Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n)."

6. Performance

You can improve performance of the regular expressions by specifying RegexOptions.Compiled, and also use instance Regex objects, not the static methods I show. Normally, your Internet connection will be the bottleneck. [C# Regex Match Examples - dotnetperls.com]

7. Summary

Here we saw how you can scrape HTML content from the Internet using the C# programming language. The code is more flexible than some other approaches. Using three regular expressions, you can extract HTML links into objects with a fair degree of accuracy. I have tested this code on several sites where it is legal. It is a valuable tool for webmasters.

Dot Net Perls
HTML | ASCII Table | Encode HTML String | HTML Tags, Removing Tags From... | HtmlTextWriter Use | iPhone Web App Example Code
C# | Dictionary StringComparer Tip | DateTime.TryParse Example | Reflection Field Example | Validate Characters in String
© 2009 Sam Allen. All rights reserved.