Section 4.1. Anatomy of a URL | Internet Forensics

4.1. Anatomy of a URL

Here are a few examples of URLs that illustrate the problem:

http://www.craic.com
http://208.12.16.5
http://%77%77%77%2e%63%72%61%69%63%2e%63%6f%6d
http://www.oreilly.com@www.craic.com

All of these take you to my web site, but only the first one is recognizable by the casual user. Most of these variants use the more arcane features of the URL specification, so I will start with a brief review of that. The general syntax of a URL is as follows:

<protocol>://<user>:<password>@<host>:<port>/<url-path>

This can be simplified to produce something that looks almost familiar:

<protocol>://<host>/<url-path>

<protocol>: This notation refers to the network protocol being invoked to transfer data back and forth. This is usually the hypertext transfer protocol (http) but other options include https, ftp, file, and mailto.
<host>: The address of the web server, represented as a fully qualified domain name (FQDN), such www.craic.com, or a numeric IP address, such as 208.12.16.5.
<url-path>: The path to a specific file or directory on that web server.
<port>: This allows you specify the TCP/IP port to use in the http transaction. The default port is 80, but you sometimes see other ports specified, such as 8080.
<username> and <password>: These are rarely seen in normal URLs. When you visit a web site that has restricted access you usually authenticate yourself via a pop-up window. You can enter the same information in the URL if you want to, but this is a very bad idea because your password is in plain view. In fact the only people who use this mechanism are the bad guys trying to con us, as you shall see shortly.

The ultimate reference for this syntax is RFC1738, "Uniform Resource Locators (URL)," issued by the Network Working Group in 1994 and written by Berners-Lee, Masinter & McCahill (http://www.w3.org/Addressing/rfc1738.txt).

4.1.1. Encoding Characters in URLs

Alongside the syntax are the encodings that can be applied to the different components. Certain characters such as &, ?, and = have special meanings in a URL string. Including these in the name of a file on a web site could have unwanted consequences when interpreted by a web server in the context of a URL.

By way of an example, consider the slightly contrived example of an HTML file with the name test?key=value.html. In the form of a URL, it looks like this:

http://www.craic.com/test?key=value.html

The web server takes this string at face value and tries to execute a CGI script called test, setting the parameter key to value.html. The server returns an error because the script test does not exist. You get around this by encoding the special characters in hexadecimal. The web server ignores these when parsing the URL, converting them only when it tries to retrieve the file.

Hexadecimal codes are two-character strings and are denoted within a URL by a preceding % character. You can see the entire hexadecimal character set on a Unix system with the command man ascii. In the previous example, ? is encoded as %3f and = as %3d. When you type this form of the URL into the browser, you see the intended web page.

http://www.craic.com/test%3fkey%3dvalue.html

Any other character in the URL path or hostname can be encoded in hexadecimal. The one you will be most familiar with is the space character, encoded as %20. A number of web browsers will encode spaces automatically if you include them in your URL. Spaces can also be replaced by + characters.

This mechanism is part of the URL specification, so web servers are built to handle them. This feature allows you to encode not just the special characters but essentially entire URLs in hexadecimal and have them function normally. Hence the URL for my web site can be represented as:

http://%77%77%77%2e%63%72%61%69%63%2e%63%6f%6d

Decoding a hexadecimal URL back to ASCII is tedious in the extreme, so Example 4-1 provides a simple Perl script that does the job for you. Example 4-2 allows you to encode your ASCII text as hexadecimal.

Example 4-1. decode_hex_url.pl

 #!/usr/bin/perl -w die "Usage: $0 <hex encoded URL>" unless @ARGV == 1; $ARGV[0] =~ s/\%(..)/chr hex $1/ge; print $ARGV[0] . "\n";

Example 4-2. encode_hex_url.pl

 #!/usr/bin/perl -w die "Usage: $0 <ASCII URL>" unless @ARGV == 1; for(my $i=0; $i < length $ARGV[0]; $i++) {     my $c = substr($ARGV[0], $i, 1);     printf "%%%02lx", ord $c; } print "\n";

Here is a real example using a hybrid of ASCII and hexadecimal to make you think it is a legitimate URL at a major bank. It's a long URL so I've had to split it into two lines:

http://web.da- us.citibank.com%2E%75%73%65%72%73%65%74%2E%6E %65%74:%34%39%30%33/%63/%69%6E%64%65%78%2E%68%74%6D

Translated back to ASCII, it reveals that the bank's domain is simply part of the hostname of a totally different server:

http://web.da-us.citibank.com.userset.net:4903/c/index.htm

4.1.2. International Domain Names

Historically, domain names have only been able to include letters from the English alphabet, numbers, and dashes. This has posed a problem for companies in non-English-speaking countries that wanted a domain name that matched their brand as written in Arabic, Chinese, and so forth. The workaround to this is called International Domain Names (IDN), and it involves encoding non-English characters, represented in Unicode, as basic ASCII strings. This encoding is called punycode . The idea is that the existing machinery of the Internet will continue to use the limited character set but that web browsers would decode punycode entities into their real representation. For example, the domain bücher.ch, with a single non-ASCII character, would be represented as xnbcher-kva.ch. It's an ugly syntax, but that would normally be hidden from the user.

There is a lot of interest in IDN at the moment, and most of the major browsers do support it. But this new functionality brings with an opportunity for those who want to impersonate the URLs of other companies. Unicode is able to represent essentially every character in every language used in the world today, and then some. Many of those codes can be handled by punycode. Among them are equivalents to standard ASCII characters, which can be used to trick a user into thinking they are going to one site when in fact they are taken to something quite different. For example, the Unicode character called "Cyrillic Small Letter A" looks exactly like the ASCII lowercase a when displayed in a browser. This is called a homograph, but because it is a non-ASCII character, it can only be represented in an encoded IDN. Eric Johanson and colleagues in The Shmoo Group (http://www.shmoo.com) realized this and have published the exploit in order to publicize the problem.

They encoded the string paypal.com in punycode, replacing the first a with the Cyrillic character. This resulted in the string xnpypal-4ve.com, a new domain that they proceeded to register. Anyone entering http://www.xnpypal-4ve.com into an IDN-enabled browser will see it translated to http://www.paypal.com but the returned page comes from the first domain.

This is a very clever exploit that has some serious implications for the success of IDNs. It has yet to turn up in a real phishing attempt, but it received quite a lot of publicity following its publication. In response, new downloads of the Firefox and Mozilla browsers have IDN support turned off by default. One solution to this would be to remove support for specific encoded homographic characters in browsers and to prevent domain names that contain them being registered. But that will require significant cooperation from domain registrars, which may be difficult to obtain.