10.5. Characters in Domain Names and URLsThe use of the Web and the Internet in general has become more genuinely global and multilingual than it used to be. This has made it more obvious that we need possibilities for using non-ASCII characters in URLs (web addresses) as well as in Internet domain names . These two are somewhat connected, but not the same thing. You could have a domain name like école.example that you wish to use in different contexts, such as email addresses. You could also wish to use a URL like http://école.example/Noël where you have a non-ASCII character not only in the server part (the domain name) but also elsewhere. Internet domain names, especially those of web servers, have become very important in business. Companies typically advertise their web sites by printing the domain name in their brochures and ads, and it is essential that potential users see the name as natural, understandable, and easy to remember. It is therefore understandable that companies and other organizations did not like the limitation to ASCII. If you company's name contains the word Müller, you don't like the idea of having to spell it as Muller or Mueller. Unfortunately, internationalization of domain names and URLs is still a work in progress, though actually making some progress. Many countries have already allowed the use of non-ASCII characters in the domains that are registered under the country domain. This addresses some of the most critical business issues. 10.5.1. Internationalized Domain Names (IDN)Internationalization of Internet domain names is based on a special ad hoc method. Instead of extending the character repertoire in any general way, which would mean thorough changes to the infrastructure, we interpret some special combinations of ASCII characters as indicating non-ASCII characters. This is in a sense yet another example of escape notations, which we discussed in Chapter 2. 10.5.1.1. The IDNA implementationThe Internationalized Domain Name (IDN) idea uses character combinations containing two consecutive hyphen-minus characters (--) for special purposes. Such a combination is hardly meaningful as such; a single hyphen-minus may well appear in a normal domain name, but why would anyone use two of them in succession? Since 1998, different proposals have been made and debated, but in 2005, "Internationalizing Domain Names in Applications (IDNA) " was chosen as the way to implement IDN. Its basic definition is in RFC 3490, and it works as follows:
The resulting domain name, www.xn--hrm-qlac.fi, is not meant to be written or seen as such. However, technically, it is an Internet domain name, and it can be used as such. In fact, it is the domain name in this case. The string "www.härmä.fi" is just a notation that denotes this name, or maps to it, on software that supports IDNA. Thus, on browsers that support IDNA, you can type either of the domain names to access the site, but on other browsers, you need to type the awkward real domain name. 10.5.1.2. Security threatsAs we mentioned in Chapter 6, IDNs raise serious security problems. If the full Unicode repertoire were allowed in IDNs, in any mixture, it would be all too easy to mislead people. For example, someone might register a domain name that has an IDN form like www.money.example, where the letter "o" is the Cyrillic small letter "o." Since that letter is indistinguishable from the Latin small letter "o," people would believe they are visiting www.money.example (with Latin "o") and type their username and password there. The cheater could then abuse this information to steal money, for example. Generally, ease of use tends to imply threats to security, and IDNA is meant to make internationalized domain names easy to use. Further problems are caused by people's tendency to follow links in email messages and on web pages, instead of typing in a web address or picking it up from a list of bookmarks (favorites). A large part of the security problem would be avoided if people typed in addresses, or used addresses that they have previously typed in. They would access the real www.money.example and not the fake. However, since people's habits are difficult to change, guidelines have been designed to reduce the risks by restricting the variety of characters and combinations in IDNs. There is a draft Unicode Technical Report #36, "Unicode Security Considerations," which addresses such problems, at http://www.unicode.org/draft/reports/tr36/tr36.html. The file http://unicode.org/draft/reports/tr36/data/draft-restrictions.txt contains a draft list of characters to be excluded, for one reason or another. The general idea is to allow names of the form that is normally used in a script or language but exclude characters that have no such normal use, such as phonetic symbols and most mathematical symbols. Of course, there are borderline cases. 10.5.2. Characters in URLsIn Chapter 6, we described URL encoding, which was originally introduced as a method for using some ASCII characters that are not allowed as suche.g., encoding a space character as %20. Later, it was extended to encode octets rather than just ASCII characters. In the modern approach, the implied primary character encoding is UTF-8, and URL encoding then maps the octets used in UTF-8 to %xx notations if needed. Although the mechanisms in principle let you create URLs with non-ASCII characters anywhere, it will take a long time before they work safely. You still need to use addresses that can be written in ASCII without any special conventions, even if they won't be easy to users or natural. For example, assume that you would really like to use a URL containing the part "skål," such as http://www.example/skål/. Maybe you expect your potential visitors to try to type "skål" simply because they are used to that spelling, even they have seen the URL printed with "skaal." Here is a possible strategy:
Additional complications arise if you wish to use uppercase characters or to make lowercase and uppercase equivalent. Although servers may have options for making the server treat URLs as case insensitive with regards to basic Latin letters (accepting "foo," "Foo," and "FOO" as equivalent), these options probably do not apply to other letters: Å would still be different than å. Special operations, such as URL rewrite rules, would be needed to make them equivalent. |