C# Parsing HTML with HtmlAgilityPack
Let’s say you have to parse some HTML, find all images or some other DOM elements make some changes/optimizations and save the result. What can you do ?
Well, i needed something like that a few days ago and after Googling for couple of hours I ran across this this great library called HtmlAgilityPack.
Sure, you can do it by using C# WebBrowser control or MSHTML, but you will have to forget about performance or multithreading – it will eat all your memory and the CPU won’t know what’s hit it.
Now, back to the HtmlAgilityPack.
The usage is very simple, and the performance is great ( at least for my needs ).
Following is an example how we can find all images without an “alt” attribute, add it and resave the Html document.
HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
foreach (HtmlNode HN in NoAltElements)
{
HN.Attributes.Append("alt", "no alt image");
}
}
HD.Save(@"e:\test.htm");
Categories: ASP.NET, C#, HTML, Programming


Good tip. HTMLAgilityPack sounds interesting. I haven’t ever used the web browser control for parsing though. .Net has some great XML classes for the job.
But HTMLAgilityPack looks as if it does a commendable job. Thanks for the tip, should have had a link there.
@Cyril Gupta
.NET indeed has some great classes for XML, but not for HTML.
Just try to open some HTML/XHTML file (but a real one, not just head and body) with XmlDocument/XDocument and you will see what will happen.
I didn’t see a link to Html Agility Pack in this post so I figured I’d save everyone coming here a google. You’ll find it on codeplex at http://htmlagilitypack.codeplex.com
Also there is a new release that now supports LINQ to provide an alternative to the XPath navigation which does have some bugs
@Jeff Klawiter , thanks for noticing. I must have missed it somehow.
Please help me!
I want to clean my html string that is alow only my defined tags and attributes.
I want to use HtmlAgilityPack.
Thanks
@Ovi You can do that by looping through all elements while removing those you don’t need, but i think in your case using a regular expression might be a better approach.
How to get all input elements in form2 of below html file?
I tried:
HtmlDocument doc = new HtmlDocument();
doc.Load(@”D:\test.html”);
foreach (HtmlNode node in doc.GetElementbyId(“form2″).SelectNodes(“.//input”))
{
Console.WriteLine(node.Attributes["value"].Value);
}
But no luck.
Anything I did wrong?
@Bill
Hi, Bill
Use : doc.DocumentNode.SelectNodes(“//form[@id='form2']/input”), but don’t forget to check if the result isn’t null before executing foreach
HTML agility pack is really a good option.
But how to handle request timeout is challenge. I havnt found with HTML Agility pack can you suggest any idea……..
@Aditya , I don’t thinks you should load any remote HTML using the Agility Pack. Use HttpWebRequest Class to get the url content and then parse it with Agility Pack.
Thanks for your sugestion….
Awesome Work
Why do I get a NullReferenceException?
I’m still not convinced.
Hi , how can i avoid the presence of specila characters using htmlagility pack,
Say If the actual innertext of a tag contain $12.34 , but the result of html agility pack innertext shows $12.34 (Means $ in place for $.). How can i avoid this .
I want to get the exact text as it is shown in browser
Hi,
I had tried the following code but it has not not worked, seems it is not supported full Xpath, please check and let me know if I am doing any thing wrong:
static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(“http://www.google.com”);
//doc.DocumentNode.SelectSingleNode(“//*[@id=\"lst-ib\"]“);//(“/html/body/div[2]/form/div/div[2]/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/table/tbody/tr/td[2]/div/input”);
//System.Console.WriteLine(doc.DocumentNode.SelectSingleNode(“//*[@id=\"lst-ib\"]“).Id);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes(“/html/body/div[2]/form/div”))
{
HtmlAttribute att = link.Attributes["id"];
System.Console.Write(att.Value);
}
System.Console.ReadKey();
}