Home > ASP.NET, C#, HTML, Programming > C# Parsing HTML with HtmlAgilityPack

C# Parsing HTML with HtmlAgilityPack

Let’s say you have to parse some HTML, find all images or some other DOM elements make some changes/optimizations and save the result. What can you do ?

Well, i needed something like that a few days ago and after Googling for couple of hours I ran across this this great library called HtmlAgilityPack.

Sure, you can do it by using C# WebBrowser control or MSHTML, but you will have to forget about performance or multithreading – it will eat all your memory and the CPU won’t know what’s hit it.

Now, back to the HtmlAgilityPack.
The usage is very simple, and the performance is great ( at least for my needs ).

Following is an example how we can find all images without an “alt” attribute, add it and resave the Html document.

            HtmlDocument HD = new HtmlDocument();
            HD.Load (@"e:\test.htm");
            var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
            if (NoAltElements != null)
            {
                foreach (HtmlNode HN in NoAltElements)
                {
                    HN.Attributes.Append("alt", "no alt image");
                }
            }

            HD.Save(@"e:\test.htm");
VN:F [1.9.3_1094]
Rating: 4.6/5 (11 votes cast)
C# Parsing HTML with HtmlAgilityPack, 4.6 out of 5 based on 11 ratings

  • DZone
  • Digg
  • Twitter
  • Yahoo Buzz
  • StumbleUpon
  • Delicious
  • Technorati Favorites
  • LiveJournal
  • Reddit
  • Share/Bookmark

A few posts you might find interesting:

  1. C# Image Processing with AForge.NET Framework
  2. jTemplates – jQuery Template Engine
  3. C# string.Empty vs “”
  4. Script as blocking element and dynamic script loading using document.write
  5. SpeedTrace – .NET Profiler and Tracer

Categories: ASP.NET, C#, HTML, Programming
  1. July 15th, 2009 at 02:48 | #1

    Good tip. HTMLAgilityPack sounds interesting. I haven’t ever used the web browser control for parsing though. .Net has some great XML classes for the job.

    But HTMLAgilityPack looks as if it does a commendable job. Thanks for the tip, should have had a link there.

    VA:F [1.9.3_1094]
    Rating: 4.3/5 (3 votes cast)
  2. July 15th, 2009 at 07:11 | #2

    @Cyril Gupta
    .NET indeed has some great classes for XML, but not for HTML.
    Just try to open some HTML/XHTML file (but a real one, not just head and body) with XmlDocument/XDocument and you will see what will happen.

    VN:F [1.9.3_1094]
    Rating: 5.0/5 (2 votes cast)
  3. October 3rd, 2009 at 17:33 | #3

    I didn’t see a link to Html Agility Pack in this post so I figured I’d save everyone coming here a google. You’ll find it on codeplex at http://htmlagilitypack.codeplex.com

    Also there is a new release that now supports LINQ to provide an alternative to the XPath navigation which does have some bugs

    VA:F [1.9.3_1094]
    Rating: 5.0/5 (1 vote cast)
  4. October 3rd, 2009 at 23:51 | #4

    @Jeff Klawiter , thanks for noticing. I must have missed it somehow.

    VN:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
  5. Ovi
    October 9th, 2009 at 10:24 | #5

    Please help me!
    I want to clean my html string that is alow only my defined tags and attributes.
    I want to use HtmlAgilityPack.
    Thanks

    VA:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
  6. October 9th, 2009 at 11:52 | #6

    @Ovi You can do that by looping through all elements while removing those you don’t need, but i think in your case using a regular expression might be a better approach.

    VN:F [1.9.3_1094]
    Rating: 5.0/5 (1 vote cast)
  7. March 5th, 2010 at 11:40 | #7

    How to get all input elements in form2 of below html file?

    I tried:
    HtmlDocument doc = new HtmlDocument();
    doc.Load(@”D:\test.html”);

    foreach (HtmlNode node in doc.GetElementbyId(“form2″).SelectNodes(“.//input”))
    {
    Console.WriteLine(node.Attributes["value"].Value);
    }

    But no luck.
    Anything I did wrong?

    VA:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
  8. March 15th, 2010 at 18:36 | #8

    @Bill
    Hi, Bill

    Use : doc.DocumentNode.SelectNodes(“//form[@id='form2']/input”), but don’t forget to check if the result isn’t null before executing foreach

    VN:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
  9. May 18th, 2010 at 12:14 | #9

    HTML agility pack is really a good option.
    But how to handle request timeout is challenge. I havnt found with HTML Agility pack can you suggest any idea……..

    VA:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
  10. May 18th, 2010 at 19:05 | #10

    @Aditya , I don’t thinks you should load any remote HTML using the Agility Pack. Use HttpWebRequest Class to get the url content and then parse it with Agility Pack.

    VN:F [1.9.3_1094]
    Rating: 5.0/5 (1 vote cast)
  11. May 19th, 2010 at 14:18 | #11

    Thanks for your sugestion….

    VA:F [1.9.3_1094]
    Rating: 5.0/5 (1 vote cast)
  1. November 4th, 2009 at 23:02 | #1

Subscribe without commenting

SEO Powered by Platinum SEO from Techblissonline
watch free movies