C# Parsing HTML with HtmlAgilityPack
Let’s say you have to parse some HTML, find all images or some other DOM elements make some changes/optimizations and save the result. What can you do ?
Well, i needed something like that a few days ago and after Googling for couple of hours I ran across this this great library called HtmlAgilityPack.
Sure, you can do it by using C# WebBrowser control or MSHTML, but you will have to forget about performance or multithreading – it will eat all your memory and the CPU won’t know what’s hit it.
Now, back to the HtmlAgilityPack.
The usage is very simple, and the performance is great ( at least for my needs ).
Following is an example how we can find all images without an “alt” attribute, add it and resave the Html document.
HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
foreach (HtmlNode HN in NoAltElements)
{
HN.Attributes.Append("alt", "no alt image");
}
}
HD.Save(@"e:\test.htm");
A few posts you might find interesting:

Good tip. HTMLAgilityPack sounds interesting. I haven’t ever used the web browser control for parsing though. .Net has some great XML classes for the job.
But HTMLAgilityPack looks as if it does a commendable job. Thanks for the tip, should have had a link there.
@Cyril Gupta
.NET indeed has some great classes for XML, but not for HTML.
Just try to open some HTML/XHTML file (but a real one, not just head and body) with XmlDocument/XDocument and you will see what will happen.
I didn’t see a link to Html Agility Pack in this post so I figured I’d save everyone coming here a google. You’ll find it on codeplex at http://htmlagilitypack.codeplex.com
Also there is a new release that now supports LINQ to provide an alternative to the XPath navigation which does have some bugs
@Jeff Klawiter , thanks for noticing. I must have missed it somehow.
Please help me!
I want to clean my html string that is alow only my defined tags and attributes.
I want to use HtmlAgilityPack.
Thanks
@Ovi You can do that by looping through all elements while removing those you don’t need, but i think in your case using a regular expression might be a better approach.
How to get all input elements in form2 of below html file?
I tried:
HtmlDocument doc = new HtmlDocument();
doc.Load(@”D:\test.html”);
foreach (HtmlNode node in doc.GetElementbyId(“form2″).SelectNodes(“.//input”))
{
Console.WriteLine(node.Attributes["value"].Value);
}
But no luck.
Anything I did wrong?
@Bill
Hi, Bill
Use : doc.DocumentNode.SelectNodes(“//form[@id='form2']/input”), but don’t forget to check if the result isn’t null before executing foreach
HTML agility pack is really a good option.
But how to handle request timeout is challenge. I havnt found with HTML Agility pack can you suggest any idea……..
@Aditya , I don’t thinks you should load any remote HTML using the Agility Pack. Use HttpWebRequest Class to get the url content and then parse it with Agility Pack.
Thanks for your sugestion….