Using HtmlAgility Pack to Extract Contents in a Web Page

Recently I was working on a hobby project and in that I wanted to extract all the <image> tags from a web page. There are lot of solutions available for parsing the HTML page by using regex or by using various string manipulation methods. But I was looking for a neat solution and stumbled upon the HtmlAgility Pack. It's a .NET library that allows us to parse HTML files and outputs a sturcture similar to the DOM.

It's been available for sometime now and is hosted at Codeplex and is available as a NuGet package. To install it in your solution, you can either use the NuGet package manager UI or the console. To install it from the console, run the following command

Install-Package Htmlagilitypack

Once the installation is completed, it will be added under the dependencies in project.json as well in References.

 

First you need to import the following namespace.

using HtmlAgilityPack;

To load the parsed HTML, we will be making use of the Load method in the HtmlDocument class. The LoadMethod has got several overrides which accepts various inputs like string, uri path, stream, textreader objects.

var html = new HtmlDocument();

html.Load("<html><body></body></html>"); //loads from string

html.Load(@"C:\test\sample.html") //loads from file

There is also another method LoadHtml which loads the html from the specified string and I am going to use that in our example.

var html = new HtmlDocument();

var result = GetHtmlFromWebUri("www.techrepository.in/visual-studio-tips-tricks-quick-actions"); // omitted function definition for brevity

html.LoadHtml(result)

Once the loading is completed, you can access the root node or element by using the DocumentNode property in the HtmlDocument instance and use the Descendants method to retrieve all the child elements.

html.LoadHtml(result); //loads html
var root = html.DocumentNode; // retrieves the root element

var elements = root.Descendants(); // retrieves all HTML elements in the root node

var images = root.Descendants("img"); // retrieves only the elements with <img> under the root node

You can do further filtering by specifying the attributes that have certain value as shown below.

var images = root.Descendants("img") //retrives all img tags
.Select(e => e.GetAttributeValue("src", null)) //then filters the elements that have src attribute
.Where(s => !String.IsNullOrEmpty(s)); // then further filtering is done by returing nodes that have value in src attrribute

So for fetching all the <img> tags with value in src attribute, you can use the following snippet.

html.LoadHtml(result); //html string content

var root = html.DocumentNode;

var images = root.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));


No Comments

Add a Comment