Recipe 14.16. Working with HTMLProblemYou need to parse some HTML to get certain values from it, but you don't want to write all of the HTML parsing logic yourself. SolutionUse the Microsoft.MSHTML Primary Interop Assembly wrapper and let the Internet Explorer parsing engine do the work. The first thing that has to happen is to establish a reference to the MSHTML control, which is located in the Program Files\Microsoft.NET\Primary Interop Assemblies directory in the Microsoft.mshtml.dll assembly. Once this reference has been made, just use the mshtml namespace like so: using mshtml; Now that the code is set up properly, you can use the MSHTML control to do your HTML parsing. First declare an instance of the HTMLDocument class, then declare an instance of the IHTMLDocument2 interface and fill a string with some HTML to parse: HTMLDocument htmlDoc = new HTMLDocument(); IHTMLDocument2 iHtmlDoc2 = null; string html = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\"" + "\"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">" + "<html xmlns=\"http://www.w3.org/1999/xhtml\" >" + "<head><title>Test Page</title></head>" + "<body><form method=\"post\" action=\"Default.aspx\" id=\"form1\">" + "<div><input id=\"Text1\" type=\"text\" />" + "<input id=\"Checkbox1\" type=\"checkbox\" />" + "<input id=\"Radio1\" type=\"radio\" />" + "<select id=\"Select1\">" + "<option selected=\"selected\"></option></select>" + "<input name=\"TextBox1\" type=\"text\" id=\"TextBox1\" />" + "</div></form></body></html>"; The IHTMLDocument2 interface reference is set by casting the HTMLDocument to the IHTMLDocument2 interface. Then the design mode is turned on so that any script that is embedded in the HTML does not execute while the HTML loads and is parsed. Place the HTML into the iHtmlDoc2 using the write method and close it to finish loading the HTML: // Get the IHTMLDocument2 interface. iHtmlDoc2 = (IHTMLDocument2)htmlDoc; // Put the document in design mode so we don't execute scripts // while loading. iHtmlDoc2.designMode = "On"; // Have to put some HTML in the DOM before using it. iHtmlDoc2.write(html); // Close it. iHtmlDoc2.close(); Now that the HTML is loaded and parsed, look for items of interest in it. Do this by casting the iHtmlDoc2 interface to the HTMLDocumentClass, then look at each IHTMLElement in the IHTMLElementCollection exposed by the all property on the body property for the HTMLDocumentClass. Roll over each of the IHTMLElements and check against the various type classes for various HTML elements like form, input, text areas, and more as shown in Example 14-4. Example 14-4. Parsing HTML
This code will have the following output: Form element found: form1 Input Element found: Text1 Input Element found: Checkbox1 Input Element found: Radio1 Input Element found: TextBox1 DiscussionThere are many ways that HTML can be parsed: regular expressions, straight text parsing, or even third-party product offerings. The MSHTML parser is free and prevalent but it is COM-based. Being COM-based in a .NET world carries a price tag of having to have all calls go through the COM interop layer and have items marshaled back and forth. MSHTML can be made to perform decently in many situations, but this is not a solution for high-end HTML parsing due to the overhead that will be incurred each time the COM interop layer is traversed. This should be considered a client-side operation only. The overhead would quickly degrade performance in a server-side scenario like an HttpHandler or in a high-traffic web page. That being said, if you are looking for a quick way to parse HTML in your application and it is not a potential performance hotspot, this method will do quite nicely. Like many other things in .NET, if you know what you are using it for and the scope of the work it will do, many alternatives become acceptable. See AlsoSee the "IHtmlDocument2 Interface" and "MSHTML" topics in the MSDN documentation. |