Section 10.5. Characters in Domain Names and URLs

10.5. Characters in Domain Names and URLs

The use of the Web and the Internet in general has become more genuinely global and multilingual than it used to be. This has made it more obvious that we need possibilities for using non-ASCII characters in URLs (web addresses) as well as in Internet domain names . These two are somewhat connected, but not the same thing. You could have a domain name like école.example that you wish to use in different contexts, such as email addresses. You could also wish to use a URL like http://école.example/Noël where you have a non-ASCII character not only in the server part (the domain name) but also elsewhere.

Internet domain names, especially those of web servers, have become very important in business. Companies typically advertise their web sites by printing the domain name in their brochures and ads, and it is essential that potential users see the name as natural, understandable, and easy to remember. It is therefore understandable that companies and other organizations did not like the limitation to ASCII. If you company's name contains the word Müller, you don't like the idea of having to spell it as Muller or Mueller.

Unfortunately, internationalization of domain names and URLs is still a work in progress, though actually making some progress. Many countries have already allowed the use of non-ASCII characters in the domains that are registered under the country domain. This addresses some of the most critical business issues.

10.5.1. Internationalized Domain Names (IDN)

Internationalization of Internet domain names is based on a special ad hoc method. Instead of extending the character repertoire in any general way, which would mean thorough changes to the infrastructure, we interpret some special combinations of ASCII characters as indicating non-ASCII characters. This is in a sense yet another example of escape notations, which we discussed in Chapter 2.

10.5.1.1. The IDNA implementation

The Internationalized Domain Name (IDN) idea uses character combinations containing two consecutive hyphen-minus characters (--) for special purposes. Such a combination is hardly meaningful as such; a single hyphen-minus may well appear in a normal domain name, but why would anyone use two of them in succession?

Since 1998, different proposals have been made and debated, but in 2005, "Internationalizing Domain Names in Applications (IDNA) " was chosen as the way to implement IDN. Its basic definition is in RFC 3490, and it works as follows:

Start with a domain name that may contain non-ASCII characters. We will here consider the hypothetical example of "www.härmä.fi."
Divide the name to components separated by periods, and handle each component separately. In our example, the components "www" and "fi" need no further processing, but "härmä" does.
Apply theNameprep algorithm defined in RFC 3491, as a profile of the more general Stringprep algorithm. It consists of Unicode normalization to form C (NFKC), case folding (to lowercase), mapping similar-looking characters together, and eliminating certain restricted code points. In our example, "härmä" is unchanged. In more abnormal cases, the component may change essentially.
Apply Punycode (see Chapter 6) to the result. In our example, the component "härmä" is changed to "xn--hrm-qlac."

The resulting domain name, www.xn--hrm-qlac.fi, is not meant to be written or seen as such. However, technically, it is an Internet domain name, and it can be used as such. In fact, it is the domain name in this case. The string "www.härmä.fi" is just a notation that denotes this name, or maps to it, on software that supports IDNA. Thus, on browsers that support IDNA, you can type either of the domain names to access the site, but on other browsers, you need to type the awkward real domain name.

10.5.1.2. Security threats

As we mentioned in Chapter 6, IDNs raise serious security problems. If the full Unicode repertoire were allowed in IDNs, in any mixture, it would be all too easy to mislead people. For example, someone might register a domain name that has an IDN form like www.money.example, where the letter "o" is the Cyrillic small letter "o." Since that letter is indistinguishable from the Latin small letter "o," people would believe they are visiting www.money.example (with Latin "o") and type their username and password there. The cheater could then abuse this information to steal money, for example.

Generally, ease of use tends to imply threats to security, and IDNA is meant to make internationalized domain names easy to use. Further problems are caused by people's tendency to follow links in email messages and on web pages, instead of typing in a web address or picking it up from a list of bookmarks (favorites). A large part of the security problem would be avoided if people typed in addresses, or used addresses that they have previously typed in. They would access the real www.money.example and not the fake.

However, since people's habits are difficult to change, guidelines have been designed to reduce the risks by restricting the variety of characters and combinations in IDNs. There is a draft Unicode Technical Report #36, "Unicode Security Considerations," which addresses such problems, at http://www.unicode.org/draft/reports/tr36/tr36.html. The file http://unicode.org/draft/reports/tr36/data/draft-restrictions.txt contains a draft list of characters to be excluded, for one reason or another. The general idea is to allow names of the form that is normally used in a script or language but exclude characters that have no such normal use, such as phonetic symbols and most mathematical symbols. Of course, there are borderline cases.

10.5.2. Characters in URLs

In Chapter 6, we described URL encoding, which was originally introduced as a method for using some ASCII characters that are not allowed as suche.g., encoding a space character as %20. Later, it was extended to encode octets rather than just ASCII characters. In the modern approach, the implied primary character encoding is UTF-8, and URL encoding then maps the octets used in UTF-8 to %xx notations if needed.

Although the mechanisms in principle let you create URLs with non-ASCII characters anywhere, it will take a long time before they work safely. You still need to use addresses that can be written in ASCII without any special conventions, even if they won't be easy to users or natural.

For example, assume that you would really like to use a URL containing the part "skål," such as http://www.example/skål/. Maybe you expect your potential visitors to try to type "skål" simply because they are used to that spelling, even they have seen the URL printed with "skaal." Here is a possible strategy:

First and foremost, make sure that the address with a simplified spelling ("a" instead of å) works: http://www.example/skal/.
Consider other ways that people might try to type the name if they just heard it or try to recollect it. If you know that å is often written as "aa" when only ASCII is available, you might set things up (on the server) so that http://www.example/skaal/ works too, as an alias for the same page.
You might also set things up so that http://www.example/sk%e5l/ works, as an alias, because when people type "skål" into the address box of a browser, the browser may URL encode the string according to ISO-8859-1, mapping å (U+00E5) to "%⁠e⁠5."
Then you could make the server recognize http://www.example/sk%c3%a5l/ as well. This is how http://www.example/skål/ should be URL encoded by modern principles: take the URL string, encode it as UTF-8, making å the two octets C3 and A5, and then encode these octets as "%c3" and "%a5."

Additional complications arise if you wish to use uppercase characters or to make lowercase and uppercase equivalent. Although servers may have options for making the server treat URLs as case insensitive with regards to basic Latin letters (accepting "foo," "Foo," and "FOO" as equivalent), these options probably do not apply to other letters: Å would still be different than å. Special operations, such as URL rewrite rules, would be needed to make them equivalent.