Recipe10.14.Using Built-in Regular Expressions to Parse ASP. NET Pages


Recipe 10.14. Using Built-in Regular Expressions to Parse ASP. NET Pages

Problem

You need to build a tool that parses ASP.NET pages in order to extract specific bits of information. This tool could possibly be used to detect whether specific meta tags are being used or if there are any comments that could expose information useful to a hacker.

Solution

Use the classes in the System.Web.RegularExpressions namespace. In this recipe you will focus on mapping out the start and end tags on a page, as shown in Example 10-11.

In order to make use of any of the classes in the System.Web.RegularExpressions namespace, you need to manually import the System.Web.RegularExpressions.dll file into your project.


Example 10-11. Parsing a web page

 public static void ASPNETStartEndTagParsing(string html) {     int index = 0;     while (index < html.Length)     {         Match m = null;         // Display the start tag.         TagRegex aspTag = new TagRegex();         m = aspTag.Match(html, index);         if (m.Success)         {             index = m.Index + m.Length;             Console.WriteLine("ASP.NET Start Tag");             Console.WriteLine(m.Value);             continue;         }                  // Display the end tag.         EndTagRegex aspEndTag = new EndTagRegex();         m = aspEndTag.Match(html, index);         if (m.Success)         {             index = m.Index + m.Length;             Console.WriteLine("ASP.NET End Tag");             Console.WriteLine(m.Value);             continue;         }         index++;     } } 

When the ASPNETStartEndTagParsing method is called in the following manner:

 public static void TestASPNETParsing() {     string testHTML = "<%-- Comment --%> <%@ Page Language=\"CS\" " +                       "AutoEventWireup=\"false\" CodeFile=\"Default.aspx.cs\" " +                       "Inherits=\"Default_aspx\" %>" +                       "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" " +                       "\"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">" +                       "<html xmlns=\"http://www.w3.org/1999/xhtml\"> " +                       "<head runat=\"server\"> " +                       "<title>Untitled Page</title> " +                       "</head><body><form id=\"form1\" runat=\"server\"><div> " +                       "<asp:Login ID=\"Login1\" runat=\"server\"></asp:Login>" +                       "</div></form></body></html>"; ASPNETStartEndTagParsing(testHTML); } 

The following is displayed:

 ASP.NET Start Tag <html xmlns="http: //www.w3.org/1999/xhtml"> ASP.NET Start Tag <head runat="server"> ASP.NET Start Tag <title> ASP.NET End Tag </title> ASP.NET End Tag </head> ASP.NET Start Tag <body> ASP.NET Start Tag <form  runat="server"> ASP.NET Start Tag <div> ASP.NET Start Tag <asp:Login  runat="server"> ASP.NET End Tag </asp:Login> ASP.NET End Tag </div> ASP.NET End Tag </form> ASP.NET End Tag </body> ASP.NET End Tag </html> 

Discussion

There are 15 classes within the System.Web.RegularExpressions namespace that give you the ability to parse existing aspx pages and HTML pages as well as other types of web pages. You can even parse XML to some extent. Each of these classes inherits from the System.Text.RegularExpressions.Regex class. What makes these classes unique is that each contains one regular expression that allows them to parse different aspects of a web page. For example, the CommentRegex class contains the following regular expressions:

 \G<%--(([^-]*)-)*?-%> 

which look for a comment within a web page in the following format:

 <%-- this is a comment --> 

Table 10-1 lists each class and its associated regular expression along with its description.

Table 10-1. Descriptions of the System.Web.RegularExpressions classes

Class name

Regular expression

Description

AspCodeRegex

\G<%(?!@)(?<code>.*?)%>

Parses a code block of the form <% code %>.

AspExprRegex

\G<%\s*?=(?<code>.*?)?%>

Parses an expression block of the form <%=expression %>.

CommentRegex

\G<%--(([^-]*)-)*?-%>

Parses a comment of the form <%-- comment--%>.

DatabindExprRegex

\G<%#(?<code>.*?)?%>

Parses a data binding expression of the form <%# expressions %>.

DataBindRegex

\G\s*<%\s*?#(?<code>.*?)?%>\s*\z

Parses a data binding of the form <%#expressions %>.

DirectiveRegex

\G<%\s*@(\s*(?<attrname>\w[\w:]*(?=\W))(\s*(?<equal>=)\s*"(?<attrval>[^"]*)"|\s*(?<equal>=)\s*'(?<attrval>[^']*)'|\s*(?<equal>=)\s*(?<attrval>[^\s%>]*)|(?<equal>)(?<attrval>\s*?)))*\s*?%>

Parses a directive of the form <%@directive %>.

EndTagRegex

\G</(?<tagname>[\w:\.]+)\s*>

Parses an end tag of the form </tagname>.

GTRegex

[^%]>

Parses a greater-than character that is not part of a tag.

IncludeRegex

\G<!--\s*#(?i:include)\s*(?<pathtype>[\w]+)\s*=\s*["']?(?<filename>[^\"']*?)["']?\s*-->

Parses an #include directive of the form .

LTRegex

<[^%]

Parses a less-than character that is not part of a tag.

RunatServerRegex

runat\W*server

Parses the runat attribute of the form runat="server".

ServerTagsRegex

<%(?![#$])(([^%]*)%)*?>

Parses server tags of the form <% data %>.

SimpleDirectiveRegex

<%\s*@(\s*(?<attrname>\w[\w:]*(?=\W))(\s*(?<equal>=)\s*"(?<attrval>[^"]*)"|\s*(?<equal>=)\s*'(?<attrval>[^']*)'|\s*(?<equal>=)\s*(?<attrval>[^\s%>]*)|(?<equal>)(?<attrval>\s*?)))*\s*?%>

Parses a directive of the form <%@directive %>. Note that the only difference between this regex and the one used by the DirectiveRegex is the lack of the \G, which forces the next match to start where the last match ended.

TagRegex

\G<(?<tagname>[\w:\.]+)(\s+(?<attrname>\w[-\w:]*)(\s*=\s*"(?<attrval>[^"]*)"|\s*=\s*'(?<attrval>[^']*)'|\s*=\s*(?<attrval><%#.*?%>)|\s*=\s*(?!'|")(?<attrval>[^\s=/>]*)(?!'|")|(?<attrval>\s*?)))*\s*(?<empty>/)?>

Parses a beginning tag of the form <tagname> or <asp:tagname>, including any attributes and their values.

TextRegex

\G[^<]+

Can be used to parse the text between two tags. Use TagRegex to find the ending of a beginning tag and then use this class to find any text between it and the next tag.


You will notice that some of these classes are designed to operate on the matches of another class. For example the RunatServerRegex class can determine if a particular tag is written to be executed on the server or not. The following code displays all start tags and whether or not they are written to be executed on the server:

 public static void ASPNETStartTagParsing(string html) {     int index = 0;     while (index < html.Length)     {         Match m = null;         // Display the start tag and whether it contains a runat="server" attribute.         TagRegex aspTag = new TagRegex();         m = aspTag.Match(html, index);         if (m.Success)         {             index = m.Index + m.Length;             Console.WriteLine("ASP.NET Start Tag");             Console.WriteLine(m.Value);             RunatServerRegex aspRunAt = new RunatServerRegex();             Match mInner = aspRunAt.Match(m.Value, 0);             if (mInner.Success)             {                 Console.WriteLine("\tASP.NET RunAt");                 Console.WriteLine("\t" + mInner.Value);             }                          continue;                      }                  index++;     } } 

Nesting these ASP.NET parsing classes in this manner will allow you to tear apart a web page quite easily.



C# Cookbook
Secure Programming Cookbook for C and C++: Recipes for Cryptography, Authentication, Input Validation & More
ISBN: 0596003943
EAN: 2147483647
Year: 2004
Pages: 424

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net