Recipe 9.11. Preventing Email Address Harvesting


Problem

You need to protect the email addresses listed on your site so they don't fall prey to spammers.

Solution

Employ one or more of these techniques:

  • Don't list any unprotected email addresses on your site.

  • Disguise addresses you must list on your site, without sacrificing your visitors' ability to click or copy them for legitimate use.

  • Create a script that sends web site messages to your mail server using logic that hides the actual addresses.

  • Block known harvesting agentsor spambotsfrom accessing your site.

  • Set up a spambot trap.

Discussion

Taken all together, these methods are not guaranteed to stop spammers from getting addresses from your site. The only way to do that is to keep all email addressesdisguised or otherwiseoff your site. But that's not practical for most web sites. And let's face it, the day when that becomes the only viable option will be the day the spammers have won.

One of the many ways that spammers get new addresses is with spambots, which crawl the web day and night to scrape web pages for email addresses. Spambots also scour Usenet and online forum postings, domain registrant information in whois databases, and poorly protected web-based mailing lists for new recipients their masters can barrage with junk. If you're not getting spam on at least one of your email accounts, you might want to check your pulse.

If you have unprotected email addresses in your web page code, remove them and consider ditching them altogether. New addresses that you must list on your site should be cloaked using encoding or JavaScript (the See Also section of this Recipe has a link to a well-known online tool for encoding email addresses).

Take the address wscb@daddison.com. Encoding it with the online tool produces a string of code in which every character has been converted to its HTML entity equivalent:

 wdcb@daddison&#046  ;com 

The character string can be pasted into your web page code, either in mailto links or not. Browsers will render the coded characters as wscb@daddison.com, allowing visitors to read, copy, or click the address as expected, while less sophisticated spambots will fail to recognize them as an email address.

This method has been around a while, so it's reasonable to assume that some spambots have gotten wise to this trick. I continue to use it as a first line of defense, although lately I've begun to mix other stumbling blocks into my concealed addresses. My hope is that combining encoded characters with unencoded characters and commented spaces will keep me one step ahead of the harvesters. For example:

 <a href="mailto: w&#100;&#099;b&#064;&#100;a&#100;&#100;&#105;s&#111;&#110;&#046;&#099; &#111;&#109;">wdcb<!-- -->@<!-- -->daddison.com</a> 

You also can use JavaScript's document.write( ) method to disguise email addresses. Break up the linked address into random segments, and then reassemble them on the page, like this:

 <script language=javascript> h='daddison.com'; n='wdcb'; document.write('<a href="mai'); document.write('lto:'+n+'@'); document.write(h+'">'); document.write(n+'@'); document.write(h); document.write('</a>') </script> <noscript>wdcb(at)daddison.com</noscript> 

Here the variables h (for host) and n (for name) are sprinkled into the code sections that together will create a standard email link. Since JavaScripts are run by the browser, the spambot's source view of the page shows the script itself (rather than its output). The rendered version that visitors see displays a linked address that's clickable and copyable. As with all client-side scripting solutions, the visitor must have JavaScript enabled for this method to work. The <noscript> section of the code provides a less functional alternative for those who do not.

A contact form is a better way to allow your site visitors to communicate with you via email. The widely used Perl script formmail.pl requires the recipient's address to be passed as an argument from the form. So, if you're using this basic method (or a similar one), the recipient field values in your form code must be disguised just like linked addresses. A less vulnerable scriptwhich you can write yourself or find onlineconceals the actual email address with logic. For example, you could define an array variable of names:

 $names = ("doug","amy","bob","jane","eleanor","phoebe") 

Then send the element number to the script in a form field, where the script concatenates it with the @hostname, and send the full recipient address and message to the mail server:

 <select name="recipient"> <option label="Doug" value="0">Doug</option> <option label="Amy" value="1">Amy</option> <option label="Bob" value="2">Bob</option> <option label="Jane" value="3">Jane</option> <option label="Eleanor" value="4">Eleanor</option> <option label="Phoebe" value="5">Phoebe</option> </select> 

After you've done your best to protect the email addresses on your site, it's time to take the fight to the spammers. First, set up a rewrite rule in your Apache configuration or an .htaccess file to use Apache's mod_rewrite module (if it's installed for your web server) to redirect known spambotsbased on their self-identified user agent nameto their own special page on your site:

 RewriteEngine on RewriteCond $1 !^spam/index.html RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] …conditions for other spambot agents… RewriteRule (.*) /spam/index.html [L,R] 

The first line of code enables the rewrite engine and the first condition checks to make sure the redirect has not already occurred to prevent the rewrite rule from looping endlessly. Then each subsequent condition identifies a spambot agent for which the rule will apply. If the client software accessing your site identifies itself as one of many known spambot agents, the server will redirect the request to /spam/index.html. (Check the See Also section of this Recipe for a link to a long-lived online discussion about using this method, as well as the names of many known spambot agents.)

Two notes of caution about using this method: first, believe it or not, many spambot agents do not identify themselves honestly. They might instead claim to be one of the many more common browser agents (Internet Explorer, Netscape, or Mozilla), and you can't redirect those requests without also turning away legitimate visitors to your site. Also, giving Apache a long list of user agents to check before responding to a request can significantly hinder performance if the server must do so for every directory on your site. For that reason, I recommend that you isolate this method for a specific file or in a specific directory where you must display email addresses, such as http://yoursite.com/contact/. In other words, let spambots roam the address-free pages of your site, but deflect them from the pages they really want to find.

On my special spambot page, I set a trap using the well-known Perl script wpoison.pl, which I embedded in the file /spam/index.html using the server-side include tag <!--#include virtual="/cgi-bin/wpoison.pl"-->. The Wpoison script (the download link is in the See Also section of this Recipe) generates a page of bogus email addresses and links to itself (see Figure 9-7). Confined to a self-referencing page, spambots will devour the garbage data, making it harder for their masters to dump their junk on you.

Figure 9-7. The spambot trap Wpoison generates pages of nonsense addresses and links


See Also

Encode your email addresses with this online tool: http://www.wbwip.com/wbw/emailencoder.html. For more information on redirecting spambots and setting a trap, check out this online discussion http://www.webmasterworld.com/forum13/687-1-10.htm and the Wpoison developer's site at http://www.monkeys.com/wpoison.



Web Site Cookbook.
Web Site Cookbook: Solutions & Examples for Building and Administering Your Web Site (Cookbooks (OReilly))
ISBN: 0596101090
EAN: 2147483647
Year: N/A
Pages: 144
Authors: Doug Addison

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net