Home > ASP.NET, C#, HTML, Programming > C# Parsing HTML with HtmlAgilityPack

C# Parsing HTML with HtmlAgilityPack

Let’s say you have to parse some HTML, find all images or some other DOM elements make some changes/optimizations and save the result. What can you do ?

Well, i needed something like that a few days ago and after Googling for couple of hours I ran across this this great library called HtmlAgilityPack.

Sure, you can do it by using C# WebBrowser control or MSHTML, but you will have to forget about performance or multithreading – it will eat all your memory and the CPU won’t know what’s hit it.

Now, back to the HtmlAgilityPack.
The usage is very simple, and the performance is great ( at least for my needs ).

Following is an example how we can find all images without an “alt” attribute, add it and resave the Html document.

            HtmlDocument HD = new HtmlDocument();
            HD.Load (@"e:\test.htm");
            var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
            if (NoAltElements != null)
            {
                foreach (HtmlNode HN in NoAltElements)
                {
                    HN.Attributes.Append("alt", "no alt image");
                }
            }

            HD.Save(@"e:\test.htm");
VN:F [1.9.22_1171]
Rating: 4.6/5 (16 votes cast)
C# Parsing HTML with HtmlAgilityPack, 4.6 out of 5 based on 16 ratings
Categories: ASP.NET, C#, HTML, Programming
  1. July 15th, 2009 at 02:48 | #1

    Good tip. HTMLAgilityPack sounds interesting. I haven’t ever used the web browser control for parsing though. .Net has some great XML classes for the job.

    But HTMLAgilityPack looks as if it does a commendable job. Thanks for the tip, should have had a link there.

    VA:F [1.9.22_1171]
    Rating: 4.3/5 (3 votes cast)
  2. July 15th, 2009 at 07:11 | #2

    @Cyril Gupta
    .NET indeed has some great classes for XML, but not for HTML.
    Just try to open some HTML/XHTML file (but a real one, not just head and body) with XmlDocument/XDocument and you will see what will happen.

    VN:F [1.9.22_1171]
    Rating: 5.0/5 (2 votes cast)
  3. October 3rd, 2009 at 17:33 | #3

    I didn’t see a link to Html Agility Pack in this post so I figured I’d save everyone coming here a google. You’ll find it on codeplex at http://htmlagilitypack.codeplex.com

    Also there is a new release that now supports LINQ to provide an alternative to the XPath navigation which does have some bugs

    VA:F [1.9.22_1171]
    Rating: 5.0/5 (1 vote cast)
  4. October 3rd, 2009 at 23:51 | #4

    @Jeff Klawiter , thanks for noticing. I must have missed it somehow.

    VN:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  5. Ovi
    October 9th, 2009 at 10:24 | #5

    Please help me!
    I want to clean my html string that is alow only my defined tags and attributes.
    I want to use HtmlAgilityPack.
    Thanks

    VA:F [1.9.22_1171]
    Rating: 3.0/5 (1 vote cast)
  6. October 9th, 2009 at 11:52 | #6

    @Ovi You can do that by looping through all elements while removing those you don’t need, but i think in your case using a regular expression might be a better approach.

    VN:F [1.9.22_1171]
    Rating: 5.0/5 (1 vote cast)
  7. March 5th, 2010 at 11:40 | #7

    How to get all input elements in form2 of below html file?

    I tried:
    HtmlDocument doc = new HtmlDocument();
    doc.Load(@”D:\test.html”);

    foreach (HtmlNode node in doc.GetElementbyId(“form2″).SelectNodes(“.//input”))
    {
    Console.WriteLine(node.Attributes["value"].Value);
    }

    But no luck.
    Anything I did wrong?

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  8. March 15th, 2010 at 18:36 | #8

    @Bill
    Hi, Bill

    Use : doc.DocumentNode.SelectNodes(“//form[@id='form2']/input”), but don’t forget to check if the result isn’t null before executing foreach

    VN:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  9. May 18th, 2010 at 12:14 | #9

    HTML agility pack is really a good option.
    But how to handle request timeout is challenge. I havnt found with HTML Agility pack can you suggest any idea……..

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  10. May 18th, 2010 at 19:05 | #10

    @Aditya , I don’t thinks you should load any remote HTML using the Agility Pack. Use HttpWebRequest Class to get the url content and then parse it with Agility Pack.

    VN:F [1.9.22_1171]
    Rating: 5.0/5 (1 vote cast)
  11. May 19th, 2010 at 14:18 | #11

    Thanks for your sugestion….

    VA:F [1.9.22_1171]
    Rating: 5.0/5 (1 vote cast)
  12. Kamal Deep Singh
    March 19th, 2011 at 08:34 | #12

    Awesome Work

    VA:F [1.9.22_1171]
    Rating: 5.0/5 (1 vote cast)
  13. miri
    April 27th, 2011 at 13:27 | #13

    Why do I get a NullReferenceException?
    I’m still not convinced.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  14. July 13th, 2011 at 08:58 | #14

    Hi , how can i avoid the presence of specila characters using htmlagility pack,
    Say If the actual innertext of a tag contain $12.34 , but the result of html agility pack innertext shows $12.34 (Means $ in place for $.). How can i avoid this .
    I want to get the exact text as it is shown in browser

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  15. Raj
    October 10th, 2011 at 08:38 | #15

    Hi,
    I had tried the following code but it has not not worked, seems it is not supported full Xpath, please check and let me know if I am doing any thing wrong:

    static void Main(string[] args)
    {
    HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load(“http://www.google.com”);
    //doc.DocumentNode.SelectSingleNode(“//*[@id=\"lst-ib\"]“);//(“/html/body/div[2]/form/div/div[2]/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/table/tbody/tr/td[2]/div/input”);
    //System.Console.WriteLine(doc.DocumentNode.SelectSingleNode(“//*[@id=\"lst-ib\"]“).Id);
    foreach (HtmlNode link in doc.DocumentNode.SelectNodes(“/html/body/div[2]/form/div”))
    {
    HtmlAttribute att = link.Attributes["id"];

    System.Console.Write(att.Value);

    }
    System.Console.ReadKey();

    }

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  16. January 16th, 2013 at 07:30 | #16

    I’m trying to do something very simple with HtmlAgilityPack, I just don’t know anything about xml or nodes and I am having a lot of trouble pulling simple info from a website… Can someone please help me with this?

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
  1. November 4th, 2009 at 23:02 | #1
  2. February 11th, 2011 at 04:38 | #2

Subscribe without commenting

SEO Powered by Platinum SEO from Techblissonline
watch free movies