Recipe14.16.Working with HTML

Recipe 14.16. Working with HTML

Problem

You need to parse some HTML to get certain values from it, but you don't want to write all of the HTML parsing logic yourself.

Solution

Use the Microsoft.MSHTML Primary Interop Assembly wrapper and let the Internet Explorer parsing engine do the work. The first thing that has to happen is to establish a reference to the MSHTML control, which is located in the Program Files\Microsoft.NET\Primary Interop Assemblies directory in the Microsoft.mshtml.dll assembly. Once this reference has been made, just use the mshtml namespace like so:

 using mshtml;

Now that the code is set up properly, you can use the MSHTML control to do your HTML parsing. First declare an instance of the HTMLDocument class, then declare an instance of the IHTMLDocument2 interface and fill a string with some HTML to parse:

 HTMLDocument htmlDoc = new HTMLDocument(); IHTMLDocument2 iHtmlDoc2 = null; string html =     "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\"" +     "\"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">" +     "<html xmlns=\"http://www.w3.org/1999/xhtml\" >" +     "<head><title>Test Page</title></head>" +     "<body><form method=\"post\" action=\"Default.aspx\" id=\"form1\">" +     "<div><input id=\"Text1\" type=\"text\" />" +     "<input id=\"Checkbox1\" type=\"checkbox\" />" +     "<input id=\"Radio1\" type=\"radio\" />" +     "<select id=\"Select1\">" +     "<option selected=\"selected\"></option></select>" +     "<input name=\"TextBox1\" type=\"text\" id=\"TextBox1\" />" +     "</div></form></body></html>";

The IHTMLDocument2 interface reference is set by casting the HTMLDocument to the IHTMLDocument2 interface. Then the design mode is turned on so that any script that is embedded in the HTML does not execute while the HTML loads and is parsed. Place the HTML into the iHtmlDoc2 using the write method and close it to finish loading the HTML:

 // Get the IHTMLDocument2 interface. iHtmlDoc2 = (IHTMLDocument2)htmlDoc; // Put the document in design mode so we don't execute scripts // while loading. iHtmlDoc2.designMode = "On"; // Have to put some HTML in the DOM before using it. iHtmlDoc2.write(html); // Close it. iHtmlDoc2.close();

Now that the HTML is loaded and parsed, look for items of interest in it. Do this by casting the iHtmlDoc2 interface to the HTMLDocumentClass, then look at each IHTMLElement in the IHTMLElementCollection exposed by the all property on the body property for the HTMLDocumentClass.

Roll over each of the IHTMLElements and check against the various type classes for various HTML elements like form, input, text areas, and more as shown in Example 14-4.

Example 14-4. Parsing HTML

 //Roll over every element in the HTML. foreach (IHTMLElement htmlElem in (IHTMLElementCollection)iHtmlDoc2.body.all) {     // Note: every time we do the is and as, it does a COM call to the     // MSHTML object. This can be rather expensive so you would want to cache     // the results elsewhere once you have them, not just keep calling     // properties on it as those end up making a round-trip as well.     if (htmlElem is HTMLAnchorElementClass)     {         HTMLAnchorElementClass anchor = htmlElem as HTMLAnchorElementClass;         if (anchor != null)             Console.WriteLine("Anchor element found: " + anchor.href);     }     else if (htmlElem is HTMLFormElementClass)     {         HTMLFormElementClass form = htmlElem as HTMLFormElementClass;         if (form != null)             Console.WriteLine("Form element found: " + form.id);     }     else if (htmlElem is HTMLGenericElementClass)     {         HTMLGenericElementClass genElem = htmlElem as HTMLGenericElementClass;         if (genElem != null)             Console.WriteLine("Input Element found: " + genElem.scopeName +                 "." + genElem.tagName);     }     else if (htmlElem is HTMLInputElementClass)     {         HTMLInputElementClass input = htmlElem as HTMLInputElementClass;         if (input != null)             Console.WriteLine("Input Element found: " + input.id);     }     else if (htmlElem is HTMLTextAreaElementClass)     {         HTMLTextAreaElementClass text = htmlElem as HTMLTextAreaElementClass;         if (text != null)             Console.WriteLine("Text Area Element found: " + text.name);     } }

This code will have the following output:

 Form element found: form1 Input Element found: Text1 Input Element found: Checkbox1 Input Element found: Radio1 Input Element found: TextBox1

Discussion

There are many ways that HTML can be parsed: regular expressions, straight text parsing, or even third-party product offerings. The MSHTML parser is free and prevalent but it is COM-based. Being COM-based in a .NET world carries a price tag of having to have all calls go through the COM interop layer and have items marshaled back and forth. MSHTML can be made to perform decently in many situations, but this is not a solution for high-end HTML parsing due to the overhead that will be incurred each time the COM interop layer is traversed. This should be considered a client-side operation only. The overhead would quickly degrade performance in a server-side scenario like an HttpHandler or in a high-traffic web page.

That being said, if you are looking for a quick way to parse HTML in your application and it is not a potential performance hotspot, this method will do quite nicely. Like many other things in .NET, if you know what you are using it for and the scope of the work it will do, many alternatives become acceptable.