URL Abbreviation with mod_rewrite

URL abbreviation is one of the most effective techniques you can use to optimize your HTML. First seen on Yahoo!'s home page, URL abbreviation substitutes short redirect URLs (like r/ci ) for longer ones (like Computers_and_Internet/ ) to save space. The Apache and IIS web servers, and Manila (http://www.userland.com) and Zope (http://www.zope.org) all support this technique. In Apache, the mod_rewrite module transparently handles URL expansion. For IIS, ISAPI filters handle URL rewrites. Here are some IIS rewriting filters:

  • ISAPI_Rewrite (http://www.isapirewrite.com/)

  • OpCode's OpURL (http://www.opcode.co.uk/ components /rewrite.asp)

  • Qwerksoft's IISRewrite based on mod_rewrite (http://www.qwerksoft.com/products/iisrewrite/)

Cocoon

Many of these fancy browser detection and performance techniques can be handled elegantly with Cocoon. Cocoon is an open source web-application platform based on XML and XSLT. Actually a big Java servlet, Cocoon runs on most servlet engines. Cocoon can automatically transform XML into (X)HTML and other formats using XSLT. You can set up a style sheet and an output format for certain types of browsers (WAP, DOM, etc.), and Cocoon does the rest. It makes the difficult task of separating content, layout, and logic trivial. Cocoon can handle the following:

  • Server-side programming

  • URL rewriting

  • Browser detection

  • PDF, legacy file formats, and image generation

  • Server-side compression

You can read more about Cocoon at http://xml.apache.org/cocoon/.

URL abbreviation is especially effective for home or index pages, which typically have a lot of links. As you will discover in Chapter 19, "Case Studies: Yahoo.com and WebReference .com," URL abbreviation can save anywhere from 20 to 30 percent off of your HTML file size . The more links you have, the more you'll save.

NOTE: As with most of these techniques, there's always a tradeoff . Using abbreviated URLs can lower search engine relevance, although you can alleviate this somewhat with clever expansions with mod_rewrite.

The popular Apache web server [9] has an optional module, mod_rewrite, that enables your server to automatically rewrite URLs. [10] Created by Ralf Engelschall, this versatile module has been called the "Swiss Army knife of URL manipulation." [11] mod_rewrite can handle everything from URL layout, load balancing, to access restriction. We'll be using only a small portion of this module's power by substituting expanded URLs with regular expressions.

[9] Netcraft, Ltd, "Netcraft Web Server Survey" [online], (Bath, UK: Netcraft, Ltd., October 2002 [cited 13 November 2002]), available from the Internet at http://www. netcraft .com/survey/. For active sites, over 65 percent use Apache.
[10] Ralf S. Engelschall, "Apache 1.3 URL Rewriting Guide" [online], (Forest Hill, MD: Apache Software Foundation, 1997 [cited 13 November 2002]), available from the Internet at http://httpd.apache.org/docs/misc/rewriteguide.html. Ralf Engelschall is the author of mod_rewrite, the module used for URL rewriting. See also http://www.engelschall.com/pw/apache/rewriteguide/.
[11] Apache Software Foundation, "Apache Module mod_rewrite" [online], (Forest Hill, MD: Apache Software Foundation, 2001 [cited 13 November 2002]), available from the Internet at http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html. An URL rewriting engine.

The module first examines each requested URL. If it matches one of the patterns you specify, the URL is rewritten according to the rule conditions you set. Essentially, mod_rewrite replaces one URL with another, allowing abbreviations and redirects.

This URL rewriting machine manipulates URLs based on various tests including environment variables , time stamps, and even database lookups. You can have up to 50 global rewrite rules without any discernible effect on server performance. [12] Abbreviated URI expansion requires only one.

[12] Ralf S. Engelschall and Christian Reiber, "URL Manipulation with Apache" Heise Online [online], (Hanover, Germany: Heise Zeitschriften Verlag GmbH, 2001), available from the Internet at http://www.heise.de/ix/artikel/E/1996/12/149/. English translation.

Tuning mod_rewrite

To install mod_rewrite on your Apache Web server, you or your IT department needs to edit one of your server configuration files. The best way to run mod_rewrite is through the httpd.conf file, as this is accessed once per server restart. Without configuration file access, you'll have to use .htaccess for each directory. Keep in mind that the same mod_include performance caveats apply; .htaccess files are slower as each directory must be traversed to read each .htaccess file for each requested URL.

The Abbreviation Challenge

The strategy is to abbreviate the longest, most frequently accessed URLs with the shortest abbreviations. Most webmasters choose one, two, or three-letter abbreviations for directories. On WebReference.com, the goal was to create a mod_rewrite rule that would expand URLs like this:

 r/d 

Into this:

 dhtml/ 

Like Yahoo!, r is the flag we've chosen for redirects. But why stop there? We can extend this concept into more directories. So turn this:

 r/dc 

Into this:

 dhtml/column 

and so on. Note that the lack of a trailing forward slash in this second example allows us to intelligently append column numbers .

With the right RewriteRule , the abbreviation of /r/c/66 expands into the string /dhtml/column66/ .

The RewriteRule Solution

To accomplish this expansion, you need to write a RewriteRule regular expression. First, you need to find the URI pattern /r/d , and then extract /d and turn it into /dhtml . Next, append a trailing slash.

Apache requires two directives to turn on and configure the mod_rewrite module: the RewriteEngine Boolean and the RewriteRule directive. The RewriteRule is a regular expression that transforms one URI into another. The syntax is shown here:

 RewriteRule <pattern> <rewrite as> 

So to create a RewriteRule to solve this problem, you need to add the following two mod_rewrite directives to your server configuration file ( httpd.conf or .htaccess ):

 RewriteEngine    On  RewriteRule      ^/r/d(.*)    /dhtml 

This regular expression matches a URL that begins with /r/ (the ^ character at the beginning means to match from the beginning of the string). Following that pattern is d(.*) , which matches one or more characters after the d . Note that using /r/dodo would expand to /dhtmlodo , so you'll have to make sure anything after r/d always includes a / .

So when a request comes in for the URI <a href="/r/d/diner/">DHTML Diner</a> , this rule expands this abbreviated URI into <a href="/dhtml/diner/">DHTML Diner</a> .

The RewriteMap Solution for Multiple Abbreviations

The RewriteRule solution would work well for a few abbreviations, but what if you want to abbreviate a large number of links? That's where the RewriteMap directive comes in. This feature allows you to group multiple lookup keys (abbreviations) and their corresponding expanded values into one tab-delimited file. Here's an example map file at ( /www/misc/redir/abbr_webref.txt ):

 d    dhtml/  dc   dhtml/column pg   programming/ h    html/ ht   html/tools/ 

The MapName specifies a mapping function between keys and values for a rewriting rule using the following syntax:

 ${ MapName : LookupKey  DefaultValue } 

When you are using a mapping construct, you generalize the RewriteRule regular expression. Instead of a hard-coded value, the MapName is consulted, and the LookupKey accessed. If there is a key match, the mapping function substitutes the expanded value into the regular expression. If there is no match, the rule substitutes the default value or a blank string.

To use this external map file, we'll add the RewriteMap directive and tweak the regular expression correspondingly. The following httpd.conf commands turn rewriting on, show where to look for your rewrite map, and show the definition of the RewriteRule :

 RewriteEngine    On  RewriteMap       abbr    txt:/www/misc/redir/abbr_webref.txt RewriteRule ^/r/([^/]*)/?(.*)    ${abbr:}    [redirect=permanent,last] 

The first directive turns on rewrites as before. The second points the rewrite module to the text version of our map file. The third tells the processor to look up the value of the matching expression in the map file. Note that the RewriteRule has a permanent redirect (301 instead of 302) and last flags appended to it. Once an abbreviation is found for this URL, no further rewrite rules are processed for it, which speeds up lookups.

Here we've set the rewrite MapName to abbr and the map file location (text format) to the following:

 /www/misc/redir/abbr_webref.txt 

The RewriteRule processes requested URLs using the regular expression:

 ^/r/([^/]*)/?(.*)    ${abbr:} 

This regular expression matches an URL that begins with /r/ . (The ^ character at the beginning means to match from the beginning of the string.) Then the regular expression ( [^/]* ) matches as many non-slash characters it can to the end of the string. This effectively pulls out the first string between two slashes following the /r . For example, in the URL /r/pg/javascript/ , this portion of the regular expression matches pg . It also will match ht in /r/ht . (Because there are no slashes following, it just continues until it reaches the end of the URL.)

The rest of the pattern /?(.*) matches 0 or 1 forward slashes / with any characters that follow. These two parenthesized expressions will be used in the replacement pattern.

The Replacement Pattern

The substitution (${abbr:$1}$2 ) is the replacement pattern that will be used in the building of the new URL. The $1 and $2 variables refer back ( backreferences ) to the first and second patterns found in the supplied URL. They represent the first set of parentheses and the second set of parentheses in the regular expression, respectively. Thus for /r/pg/javascript/ , $1 = "pg" and $2 = "javascript/" . Replacing these in the example produces the following:

 ${abbr:pg}javascript/ 

The ${abbr:pg} is a mapping directive that says, "Refer to the map abbr (recall our map command, RewriteMap abbr txt:/www/misc/redir/abbr_webref.txt ), look up the key pg , and return the corresponding data value for that key." In this case, that value is programming/ . Thus the abbreviated URL, /r/pg/javascript , is replaced by the following:

 /programming/javascript/ 

Voila! So you've effectively created an abbreviation expander using a regular expression and a mapping file. Using the preceding rewrite map file, the following URL expansions would occur:

 "r/dc" becomes "dhtml/column"  "r/pg" becomes "programming/" 

The server, upon seeing a matching abbreviation in the map file, will automatically rewrite the URL to the longer value.

But what happens if you have many keys in your RewriteMap file? Scanning a long text file every time a user clicks a link can slow down lookups. That's where binary hash files come in handy.

Binary Hash RewriteMap

For maximum speed, convert your text RewriteMap file into a binary *DBM hash file. This binary hash version of your key and value pairs is optimized for maximum lookup speed. Convert your text file with a DBM tool or the txt2dbm Perl script provided at http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html.

NOTE: Note that this example is specific to Apache on Unix. Your platform may vary.

Next, change the RewriteMap directive to point to your optimized DBM hash file:

 RewriteMap    abbr    dbm:/www/misc/redir/abbr_webref 

That's the abbreviated version of how you set up link abbreviation on an Apache server. It is a bit of work, but once you've got your site hierarchy fixed, you can do this once and forget it. This technique saves space by allowing abbreviated URLs on the client side and shunting the longer actual URLs to the server. The delay using this technique is hardly noticeable. (If Yahoo! can do it, anyone can.) Done correctly, the rewriting can be transparent to the client. The abbreviated URL is requested, the server expands it, and serves back the content at the expanded location without telling the browser what it has done. You also can use the /r/ flag or the RewriteLog directive to track click-throughs in your server logs.

This technique works well for sites that don't change very often: You would manually abbreviate your URIs to match your RewriteMap abbreviations stored on your server. But what about sites that are updated every day, or every hour , or every minute? Wouldn't it be nice if you could make the entire abbreviation process automatic? That's where the magic of Perl and cron jobs (or the Schedule Tasks GUI in Windows) comes in.

Automatic URL Abbreviation

You can create a Perl or shell script (insert your favorite CGI scripting language here) to look for URLs that match the lookup keys in your map file and automatically abbreviate your URLs. We use this technique on WebReference.com's home page. To make it easy for other developers to auto-abbreviate their URLs, we've created an open source script called shorturls.pl. It is available at http://www.webreference.com/scripts/.

NOTE: XSLT gives you another way to abbreviate URLs automatically. Just create the correct templates to abbreviate all the local links in your files.

The shorturls.pl script allows you to abbreviate URLs automatically and exclude portions of your HTML code from optimization with simple XML tags ( <NOABBREV> ...</NOABBREV> ).

Using this URL abbreviation technique, we saved over 20 percent (5KB) off our 24KB hand-optimized front page. We could have saved even more space, but for various reasons, we excluded some URLs from abbreviation.

This gives you an idea of the link abbreviation process, but what about all the other areas of WebReference? Here is a truncated version of our abbreviation file to give you an idea of what it looks like (the full version is available at http://www.webreference.com/scripts/):

 b    dlab/  d    dhtml/ g    graphics/ h    html/ p    perl/ x    xml/ 3c    3d/lesson dd    dhtml/dynomat/ ddd   dhtml/dynomat/dialogs/ dc    dhtml/column ... i     http://www.internet.com/ ic    http://www.internet.com/corporate/ ... jsc   http://www.javascript.com/ jss   http://www.javascriptsource.com/ jsm   http://www.justsmil.com/ ... 

Note that we use two and three-letter abbreviations to represent longer URLs on WebReference.com. Yahoo! uses two-letter abbreviations throughout their home page. How brief you make your abbreviations depends on how many links you need to abbreviate, and how descriptive you want the URLs to be.

The URL Abbreviation/Expansion Process: Step by Step

In order to enable automatic link abbreviation (with shorturls.pl) and expansion (with mod_rewrite), do the following:

  1. Create an abbreviation map file ( RewriteMap ) with short abbreviations that correspond to frequently used and longer directories separated by tabs. For example:

     d    dhtml/  g    graphics/ dc   dhtml/column gc   graphics/column ... 
  2. Add the following lines to your httpd.conf file to enable the mod_rewrite engine:

     RewriteEngine    On  RewriteMap       abbr    txt:/www/misc/redir/abbr_yrdomain.txt RewriteRule^/r/([^/]*)/?(.*)    ${abbr:}    [redirect=permanent,last] 
  3. Try some abbreviated URLs (type in /r/d , etc.). If they work, move on to step 4; otherwise , check your map and your rewrite directives. If all else fails, contact your system administrator.

  4. Convert your RewriteMap text file to a binary hash file. See http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html for the txt2dbm Perl script.

  5. Change the preceding RewriteMap directive to point to this optimized *DBM hash file:

     RewriteMap       abbr    dbm:/www/misc/redir/abbr_yrdomain 
  6. Now your rewrite engine is set up. To automate URL abbreviation, point shorturls.pl to the text version of your RewriteMap file, input your home page template and output your home page, and schedule the job with cron on UNIX/Linux, or the Schedule Tasks GUI in Windows:

     echo "\nBuilding $YRPAGE from $YRTEMPLATE\n"  /www/yrdomain/cgi-bin/shorturls.pl $YRTEMPLATE $YRPAGE 

That's it. Now any new content that appears on your home page will be automatically abbreviated according to the RewriteMap file that you created, listing the abbreviations you want.

Use Short URLs

You could name your directories using these short, cryptic abbreviations. Using descriptive names for directories and file names has advantages, however, in usability and search engine positioning. Using URL abbreviation, you can have the best of both worlds for high-traffic pages like home pages.

For front page or frequently referenced objects like single-pixel GIFs, logos, navigation bars, and site-wide rollovers, however, you can use short URLs by placing them high in your site's file structure, and using short filenames. For example:

 /i.gif (internet.com logo)  /t.gif (transparent single pixel gif) 

I've seen some folks carry the descriptive-names-at-all-cost idea to extremes. Here's a surreal-world example:

 transparent-single-pixel-gif1x1.gif (actual file name) 

Some search engine positioning firms sprinkle keywords wherever they are legaland in some places where they're not. Again, it's a tradeoff. Bulking up your pages with keyword-filled alt values and object names may increase your rankings, but with the advent of backlink-based search engines like Google and Teoma, these practices are fading in effectiveness.

You could even use content negotiation or your srm.conf file to abbreviate file type suffixes. This technique is pretty extreme, seldom used, but perfectly valid. Here's an example:

 i.g (.g = .gif, srm.conf directive of AddType  image/gif    g)  i (content negotiation resolves to i.gif, could later use i.png) 

 



Speed Up Your Site[c] Web Site Optimization
Speed Up Your Site[c] Web Site Optimization
ISBN: 596515081
EAN: N/A
Year: 2005
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net