Recipe 10.14. Using Built-in Regular Expressions to Parse ASP. NET PagesProblemYou need to build a tool that parses ASP.NET pages in order to extract specific bits of information. This tool could possibly be used to detect whether specific meta tags are being used or if there are any comments that could expose information useful to a hacker. SolutionUse the classes in the System.Web.RegularExpressions namespace. In this recipe you will focus on mapping out the start and end tags on a page, as shown in Example 10-11.
Example 10-11. Parsing a web page
When the ASPNETStartEndTagParsing method is called in the following manner: public static void TestASPNETParsing() { string testHTML = "<%-- Comment --%> <%@ Page Language=\"CS\" " + "AutoEventWireup=\"false\" CodeFile=\"Default.aspx.cs\" " + "Inherits=\"Default_aspx\" %>" + "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" " + "\"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">" + "<html xmlns=\"http://www.w3.org/1999/xhtml\"> " + "<head runat=\"server\"> " + "<title>Untitled Page</title> " + "</head><body><form id=\"form1\" runat=\"server\"><div> " + "<asp:Login ID=\"Login1\" runat=\"server\"></asp:Login>" + "</div></form></body></html>"; ASPNETStartEndTagParsing(testHTML); } The following is displayed: ASP.NET Start Tag <html xmlns="http: //www.w3.org/1999/xhtml"> ASP.NET Start Tag <head runat="server"> ASP.NET Start Tag <title> ASP.NET End Tag </title> ASP.NET End Tag </head> ASP.NET Start Tag <body> ASP.NET Start Tag <form runat="server"> ASP.NET Start Tag <div> ASP.NET Start Tag <asp:Login runat="server"> ASP.NET End Tag </asp:Login> ASP.NET End Tag </div> ASP.NET End Tag </form> ASP.NET End Tag </body> ASP.NET End Tag </html> DiscussionThere are 15 classes within the System.Web.RegularExpressions namespace that give you the ability to parse existing aspx pages and HTML pages as well as other types of web pages. You can even parse XML to some extent. Each of these classes inherits from the System.Text.RegularExpressions.Regex class. What makes these classes unique is that each contains one regular expression that allows them to parse different aspects of a web page. For example, the CommentRegex class contains the following regular expressions: \G<%--(([^-]*)-)*?-%> which look for a comment within a web page in the following format: <%-- this is a comment --> Table 10-1 lists each class and its associated regular expression along with its description.
You will notice that some of these classes are designed to operate on the matches of another class. For example the RunatServerRegex class can determine if a particular tag is written to be executed on the server or not. The following code displays all start tags and whether or not they are written to be executed on the server: public static void ASPNETStartTagParsing(string html) { int index = 0; while (index < html.Length) { Match m = null; // Display the start tag and whether it contains a runat="server" attribute. TagRegex aspTag = new TagRegex(); m = aspTag.Match(html, index); if (m.Success) { index = m.Index + m.Length; Console.WriteLine("ASP.NET Start Tag"); Console.WriteLine(m.Value); RunatServerRegex aspRunAt = new RunatServerRegex(); Match mInner = aspRunAt.Match(m.Value, 0); if (mInner.Success) { Console.WriteLine("\tASP.NET RunAt"); Console.WriteLine("\t" + mInner.Value); } continue; } index++; } } Nesting these ASP.NET parsing classes in this manner will allow you to tear apart a web page quite easily. |