Hack 53. Browse the Web Through Google s Cache

 < Day Day Up > 

Hack 53. Browse the Web Through Google's Cache

Change links in cached pages to point to the cached version.

One of the nicest (and most controversial) features of Google's web search is its ability to show you a cached version of the page. This is useful if the original server is temporarily down or is just horrendously slow. It is also useful to see if the web publisher is playing tricks on Google to try to increase their search ranking, since the cache will show you the page that the site returned when Google's bots came a-crawling. The only downside of the Google cache is that links in the cached page point to the original site (which might still be unavailable, which was the reason you had to look at the cached version in the first place).

This hack modifies the cached pages that Google displays and adds links within the cached page to also point to Google's cache of the linked page.

6.8.1. The Code

This user script runs on Google cache pages. Google uses a variety of raw IP addresses to display cached pages, so we match on any IP address or domain name and simply look at the structure of the URL path and query parameters to determine whether we're looking at a cached page. If this causes false positives for you, you can exclude specific domains with an @exclude parameter.

There is one important thing to note in this code. Normally, I would use the document.links collection to get a list of all the links on the page. However, document.links is a dynamic collection. If you add a link to the page while iterating through the collection, you could end up in an infinite loop. Therefore, I use the document.evaluate function to return a static snapshot of all the links on the page. See "Master XPath Expressions" [Hack #8] for more information about static snapshots.

Save the following user script as google.cache.user.js:

 // ==UserScript== // @name    Google Cache Continue // @namespace    http://babylon.idlevice.co.uk/javascript/greasemonkey/ // @description    Convert Google cache links to also use Google cache // @include    http://*/search?*q=cache:* // ==/UserScript== // based on code by Jonathon Ramsey // and included here with his gracious permission /* Modify these vars to change the appearance of the cache links */ var cacheLinkText = 'cache'; var cacheLinkStyle = "\ a.googleCache {\ font:normal bold x-small sans-serif;\ color:red;\ background-color:yellow;\ padding:0 0.6ex 0.4ex 0.3ex;\ margin:0.3ex;\ }\ a.googleCache:hover {\ color:yellow;\ background-color:red;\ }\ p#googleCacheExplanation {\ border:1px solid green;\ padding:1ex 0.5ex;\ font-family:sans-serif;\ }"; addStyles(cacheLinkStyle); if (googleHasNoCache()) { addUncachedLink(urlPage);  return; } var arParts = window.location.href.match(/http:\/\/[^\/]*\/([^\+]*)(\ +[^&]*)/); var urlPage = arParts[1]; var sTerms = arParts[2]; var bAlter = false; var snapLinks = document.evaluate('//a[@href]', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null); for (var i = 0; i < snapLinks.snapshotLength; i++) { var elmLink = snapLinks.snapshotItem(i); if (bAlter && linkIsHttp(elmLink)) { addCacheLink(elmLink, sTerms, cacheLinkText); } if (isLastGoogleLink(elmLink)) { bAlter = true; addExplanation(elmLink, cacheLinkText); } } function addStyles(cacheLinkStyle) { var style = document.createElement('style'); style.type = 'text/css';  style.innerHTML = cacheLinkStyle; document.body.appendChild(style); } function googleHasNoCache() { return 0 == document.title.indexOf('Google Search: cache:');  } function addUncachedLink(url) { var urlUncached = url.split('cache:')[1]; var elmP = document.createElement('p'); elmP.id = 'googleCacheExplanation'; elmP.innerHTML = "<b>Uncached:</b> <a href='http://" + urlUncached + "'>" + urlUncached + '</a>'; var suggestions = document.getElementsByTagName('blockquote')[0]; document.body.replaceChild(elmP, suggestions.previousSibling.previousSibling);  } function linkIsHttp(link) { return 0 == link.href.search(/^http/);  } function isLastGoogleLink(elmLink) { return (-1 < elmLink.text.indexOf('cached text')); } function addExplanation(link, cacheLinkText) {  var p = document.createElement('p'); p.id = 'googleCacheExplanation';  p.innerHTML = "Use <a href='" + document.location.href + "' class='googleCache'>" + cacheLinkText + "</a> links to continue using the Google cache.</a>"; var tableCell = link.parentNode.parentNode.parentNode.parentNode; tableCell.appendChild(p);  } function addCacheLink(elmLink, sTerms, cacheLinkText) { var cacheLink = document.createElement('a'); cacheLink.href = getCacheLinkHref(elmLink, sTerms); cacheLink.appendChild(document.createTextNode(cacheLinkText)); cacheLink.className = 'googleCache'; elmLink.parentNode.insertBefore(cacheLink, elmLink.nextSibling); } function getCacheLinkHref(elmLink, sTerms) { var href = elmLink.href.replace(/^http:\/\//, ''); var fragment = ''; if (hrefLinksToFragment(href)) { var arParts = href.match(/([^#]*)#(.*)/, href); href = arParts[1]; fragment = '#' + arParts[2]; } return 'http://www.google.com/search?q=cache:' + href + sTerms + fragment;  } function hrefLinksToFragment(href) { return (-1 < href.indexOf('#'));  } 

6.8.2. Running the Hack

After installing the user script (Tools Install This User Script), go to http://www.google.com and search for "xml on the web (including the quotes). At the time of this writing, the first result is for my article on O'Reilly's XML.com, titled "XML on the Web Has Failed," at http://www.xml.com/pub/a/2004/07/21/dive.html. Click the Cached link next to the first search result to see Google's cache of this article, as shown in Figure 6-12.

Figure 6-12. Cached copy of "XML on the Web Has Failed"


Each link in the article has been augmented with a "cache" link. Click the "cache" link next to the "Dive into XML" image, and it will take you to the cached copy of all the XML.com articles I've written, as shown in Figure 6-13.

Google does not keep cached copies of every page on the Internet. If a page is moved or deleted, it will eventually disappear from Google's cache. Or the publisher might use a <meta> element to tell Google not to cache a specific page. If you try to follow a link to a page that is not in Google's cache, Google will display an empty search results page informing you that the cached page could not be found, and the script will insert a link to the original page.

Figure 6-13. Cached copy of "Dive into XML" articles


     < Day Day Up > 


    Greasemonkey Hacks
    Greasemonkey Hacks: Tips & Tools for Remixing the Web with Firefox
    ISBN: 0596101651
    EAN: 2147483647
    Year: 2005
    Pages: 168
    Authors: Mark Pilgrim

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net