Hack 48. Remove Spammy Domains from Search Results

 < Day Day Up > 

Fight back against search engine spammers who register domains with multiple "hot" keywords separated by hyphens.

Google and other search engines are engaged in an ongoing arms race against spammers, who use every conceivable trick to attain top placement for lucrative search keywords. One such trick is to register a domain name with the keywords themselves, such as buy-cheap-prescription-drugs-online.com. (I just made that up, although I Wouldn't be the slightest bit surprised if it already existed. In fact, I would be surprised if it didn't.) Recently, Google has cracked down on such techniques, but some spammy domains still show up in search results.

Think of the web sites you visit on a regular basis. I'll bet that none of them contains more than one hyphen. In fact, the only time I ever see multi-hyphen domain names is when a spammer is one step ahead of Google and manages to get his site listed in the results. (I don't buy cheap prescription drugs online, but I did need to refinance my home last year. Search engine results were so overwhelmed with spam, I almost broke down and used a phone book.)

6.3.1. The Code

This user script removes Google search results where the domain contains more than one hyphen. Once again, the bulk of the logic is contained in the XPath query. This is tricky for two reasons. First, we need to count the number of instances of a particular character in a string, and XPath doesn't have a native function to do that. Second, we need to isolate the entire search result link, description, everything and remove it all at once.

We can solve the first problem (counting the hyphens in the domain) by a clever use of the XPath translate function, which "translates" a string by replacing specific characters with other characters. The key here is to tell the TRanslate function to replace a character with nothing (in other words, to remove it altogether). If we munge the URL in a certain way, and the result starts the string "//--", the original URL must have contained at least two hyphens in its domain. (Many legitimate web publishing systems generate URLs with multiple hyphens in the pathname, so we must be careful not match URLs such as http://diveintomark.org/archives/2004/08/13/safari-content-sniffing.)

A complete list of XPath functions is available at http://www.w3schools.com/xpath/xpath_functions.asp.


We can solve the second problem by using the ancestor:: axis. Each search result is wrapped in a <p > element. (I have no idea what g stands for. Google likes single-character names; it probably reduces their bandwidth costs.) Once we find a link that contains two hyphens, we can use "/ancestor::p[@class='g']" to get the surrounding paragraph, and then remove the entire search result in one shot.

Save the following user script as hyphenspam.user.js:

 // ==UserScript== // @name Hyphen Spam Remover // @namespace http://diveintomark.org/projects/greasemonkey/ // @description remove search results with 2 or more hyphens in domain // @include http://www.google.com/search* // ==/UserScript== var snapFilter = document.evaluate( "//a[starts-with(translate(translate(@href, 'http:', ''), " + "'.:abcdefghijklmnopqrstuvwxyz0123456789', ''), '//--')]" + "/ancestor::p[@class='g']", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null); for (var i = snapFilter.snapshotLength - 1; i >= 0; i--) { var elmFilter = snapFilter.snapshotItem(i); elmFilter.parentNode.removeChild(elmFilter); } 

6.3.2. Running the Hack

Go to http://www.google.com and search for buy cheap lortab site:.ru. (The site:.ru finds sites hosted in Russia. I have nothing against Russia per se, except that when I Wrote this, Google seemed to have already cracked down on most spammy domains in .com and .net, but I found several examples of such domains in .ru.) When you read this, your results will undoubtedly differ, but Figure 6-3 shows what I saw.

Figure 6-3. Google search with spammy results


Now, install the script (Tools Install This User Script), and refresh the Google search results page. My results were the same, except that the script removed the top search result, Figure 6-4. Google search, now with 10% less spam


6.3.3. Hacking the Hack

The possibilities here are infinite. Don't want to ever see search results on microsoft.com? You could alter your search habits to include -microsoft.com in every search. Or you could let Greasemonkey do it for you:

 var snapFilter = document.evaluate( "//a[contains('microsoft.com')/ancestor::p[@class='g']", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null); 

When I'm searching for a specific answer to a technical question, I often find the answer in a mailing list that has been publicly archived and indexed. Here's a variation of this hack that highlights search results that are likely to be part of an archived mailing list, as shown in Figure 6-5.

 var snapFilter = document.evaluate( "//a[contains(@href, 'pipermail') or " + "starts-with(@href, 'http://mail') or " + "starts-with(@href, 'http://list')]" + "/ancestor::p[@class='g']", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null); for (var i = snapFilter.snapshotLength - 1; i >= 0; i--) { var elmFilter = snapFilter.snapshotItem(i); elmFilter.style.backgroundColor = 'silver'; } 

Figure 6-5. Search results with highlighted mailing list archives


     < Day Day Up > 


    Greasemonkey Hacks
    Greasemonkey Hacks: Tips & Tools for Remixing the Web with Firefox
    ISBN: 0596101651
    EAN: 2147483647
    Year: 2005
    Pages: 168
    Authors: Mark Pilgrim

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net