Recipe 17.2. Accessing Content Within an HTML Document

Problem

You need to extract some information from within a web page.

Solution

Sample code folder: Chapter 17\UseHTMLDOM

While you could use standard string-manipulation techniques to scan through a web page, it's a lot of work. If the HTML content you need to parse has a consistent format with identifiable tags and elements, you can use Microsoft's Managed HTML Document Object Model (DOM) to traverse the HTML content as a set of objects.

Discussion

This recipe builds on the code developed in Recipe 17.1. Create a new Windows Forms project following the instructions in that recipe. Now add the following additional code to the form's code template:

 Private Sub WebContent_DocumentCompleted( _       ByVal sender As Object, ByVal e As _       System.Windows.Forms. _       WebBrowserDocumentCompletedEventArgs) _       Handles   WebContent.DocumentCompleted    ' ----- Extract the title and display it.    MsgBox(WebContent.Document.Title) End Sub

Run the program, and as you browse from page to page, the title of each page will appear in a message box.

The Managed HTML DOM, made available through the WebBrowser control's Document property, provides object-based access to all elements of an HTML page, including links (via the Links property), cookies associated with the page (via the Cookies string-array property), and the body content (via the Body property). You can search for specific elements by ID using the GetElementByID() method.

Specific use of the Managed HTML DOM is beyond the scope of this book. Use the MSDN documentation supplied with Visual Studio to obtain information about the HtmlElement class and other classes used within the DOM.

Problem

Solution

Discussion

See Also