Section 5.1. Capturing Web Pages


5.1. Capturing Web Pages

First, consider individual web pages: the HTML source of a single page can reveal a surprising amount about its creator, and the links contained therein help you map out the structure of the entire site. All web browsers allow you to view the source for a page and to save that to a file on your local computer. While these fundamental operations may seem trivial, there are a couple of important issues of which you need to be aware.

The first is that many of today's web pages include other files, without which they cannot be properly displayed. Images are the most obvious example, but stylesheets and JavaScript files have become increasingly common. In most cases, the links to those files are relative, not absolute, meaning they will not be available if the saved web page is opened in a browser. Either the links have to be updated in the downloaded web page or the supporting files must also be saved.

The second problem is that most web pages do not include the URL from which they were downloaded. That means that you have to save that URL string in a separate file or insert it as a comment in the saved web page. Doing either of these manually is an inconvenience.

Some browser developers have addressed these problems. Mozilla Firefox will save any associated files when a web page is saved as Web Page, complete, as shown in Figure 5-1.

Figure 5-1. Mozilla Firefox Save As dialog box


Those files are saved to a directory that is created in the same location as the saved web page. So, for example, if I save index.html to a directory, then I will find a subdirectory called index_files that contains any images, stylesheets, and so forth that were referenced by the original file. Furthermore, most links to those files will have been updated to point to the saved copies. I use the term "most" because Firefox is not able to update links that are included as parameters to JavaScript functions, such as image rollover functions. With those exceptions, the saved page and its ancillary files can be opened from a browser on that machine and the page should look the same as the original.

Although this is convenient, it does mean that the saved web page in no longer identical to the original. In fact, Firefox makes a number of changes to the HTML it saves. I presume that these are intended to ensure that saved pages contain valid HTML but the effect is that it makes comparing saved pages with originals very difficult. Consider these few lines of HTML from my home page:

     <table width="90%" border="0" align="center" cellpadding="0"     cellspacing="0">       <tr>

Firefox rearranges the attributes in the <table> tag so that they lie in alphabetical order. It also adds a new <tbody> ahead of the first <tr> tag:

     <table align="center" border="0" cellpadding="0" cellspacing="0"     width="90%">       <tbody><tr>

This type of unseen modification of files can be the source of much confusion when you want to compare files. To avoid it, you can either download files individually in Firefox, saving them as Web Page, HTML Only or use the non-interactive download tool, wget.

Internet Explorer can also save all the files associated with a page, and it solves the second problem of associating the saved web page with the original URL. It inserts a comment line at the top of the page, before the <html> tag, which records the original URL. This example shows the comment from a downloaded copy of my home page:

     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">     <!-- saved from url=(0021)http://www.craic.com/ -->     <HTML lang=en>

The number in parentheses right before the URL represents the number of characters in that URL string.

Comments like this are a useful way of recording where a page came from. They are especially interesting when they are found in the pages of phishing web sites. Here is an example from a fake U.S. Bank site that shows exactly where the original page is located:

     <!-- saved from url=(0105)http://www.updates-usbank.com/     internetBanking/RequestRouterRequestCmdId=DisplayLoginPage/     login_faild.html -->

In some cases, a page may be downloaded from an intermediary web site, rather than the original. A comment line may be the only way to track this information. On occasion you come across a page with more than one comment, like this:

     <!-- saved from url=(0044)http://iqnet.ro/poser/eb/signOutConfirm.html -->     <!-- saved from url=(0041)http://pages.ebay.com/signOutConfirm.html -->

This is particularly informative as it defines the steps that this page has taken in its evolution from the original version. It has been downloaded from ebay.com, uploaded to iqnet.ro (in Romania), downloaded from there, and finally uploaded to the site ebay.arribada-updates.com (located in Mexico), which is where I found it. Although these comment lines are not present in all HTML files, they are well worth looking for.



Internet Forensics
Internet Forensics
ISBN: 059610006X
EAN: 2147483647
Year: 2003
Pages: 121

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net