Automated Source Sifting Techniques

Automated Source Sifting Techniques

Looking through pages and pages of HTML code for small clues such as those described in this chapter can be tedious and monotonous. Fortunately, we don't always have to use the manual technique of clicking on View Source on every Web page to look for clues in the HTML source code. HTML source sifting can be automated by using a Web crawler program that creates a copy of the content of a Web site on the local file system. Once we save all the HTML content locally, we can perform recursive searches on it with utilities such as grep or findstr. We can use GNU's wget to create a mirrored copy of a Web site and then perform searches on it with grep.

Using wget

GNU wget is a noninteractive network retriever, and it can be used to fetch HTML content by recursively crawling a Web site and saving the contents on the local file system in the same hierarchical manner as on the Web site being crawled. You can download wget from the GNU Web site at http://www.gnu.org/.

Instead of repeating what can be learned from the wget documentation, let's look at an example of how wget can be used to automate source sifting. The computer forensics team working on Acme-art.com, Inc.'s case used wget to create a mirrored copy of the HTML content on www.acme-art.com for extensive analysis of the information that can be learned from it. They used wget as follows:

root@blackbox:~/wget# wget -r -m -nv http://www.acme-art.com/
02:27:54 URL:http://www.acme-art.com/ [3558] ->
"www.acme-art.com/index.html" [1]
02:27:54 URL:http://www.acme-art.com/index.cgi?page=falls.shtml [1124] ->
"www.acme-art.com/index.cgi?page=falls.shtml" [1]
02:27:54 URL:http://www.acme-art.com/images/falls.jpg [81279/81279] ->
"www.acme-art.com/images/falls.jpg" [1]
02:27:54 URL:http://www.acme-art.com/images/yf_thumb.jpg [4312/4312] ->
"www.acme-art.com/images/yf_thumb.jpg" [1]
02:27:54 URL:http://www.acme-art.com/index.cgi?page=tahoe1.shtml [1183] ->
"www.acme-art.com/index.cgi?page=tahoe1.shtml" [1]
02:27:54 URL:http://www.acme-art.com/images/tahoe1.jpg [36580/36580] ->
"www.acme-art.com/images/tahoe1.jpg" [1]
02:27:54 URL:http://www.acme-art.com/images/th_thumb.jpg [6912/6912] ->
"www.acme-art.com/images/th_thumb.jpg" [1]
02:27:54 URL:http://www.acme-art.com/index.cgi?page=montrey.shtml [1160] ->
"www.acme-art.com/index.cgi?page=montrey.shtml" [1]
02:27:54 URL:http://www.acme-art.com/images/montrey.jpg [81178/81178] ->
"www.acme-art.com/images/montrey.jpg" [1]
02:27:54 URL:http://www.acme-art.com/images/mn_thumb.jpg [7891/7891] ->
"www.acme-art.com/images/mn_thumb.jpg" [1]
02:27:54 URL:http://www.acme-art.com/index.cgi?page=flower.shtml [1159] ->
"www.acme-art.com/index.cgi?page=flower.shtml" [1]
02:27:55 URL:http://www.acme-art.com/images/flower.jpg [86436/86436] ->
"www.acme-art.com/images/flower.jpg" [1]
02:27:55 URL:http://www.acme-art.com/images/fl_thumb.jpg [8468/8468] ->
"www.acme-art.com/images/fl_thumb.jpg" [1]
02:27:55 URL:http://www.acme-art.com/news/ [2999] ->
"www.acme-art.com/news/index.html" [1]
02:27:55 URL:http://www.acme-art.com/catalogue/ [1031] ->
"www.acme-art.com/catalogue/index.html" [1]
02:27:55 URL:http://www.acme-art.com/catalogue/catalogue.cgi?id=0 [1282] ->
"www.acme-art.com/catalogue/catalogue.cgi?id=0" [1]
02:27:55 URL:http://www.acme-art.com/guestbook/guestbook.html [1343] ->
"www.acme-art.com/guestbook/guestbook.html" [1]
02:27:55 URL:http://www.acme-art.com/guestbook/addguest.html [1302] ->
"www.acme-art.com/guestbook/addguest.html" [1]
02:28:00 URL:http://www.acme-art.com/catalogue/print.cgi [446] ->
"www.acme-art.com/catalogue/print.cgi" [1]
02:28:00 URL:http://www.acme-art.com/catalogue/catalogue.cgi?id=1 [1274] ->
"www.acme-art.com/catalogue/catalogue.cgi?id=1" [1]
02:28:00 URL:http://www.acme-art.com/catalogue/catalogue.cgi?id=2 [1281] ->
"www.acme-art.com/catalogue/catalogue.cgi?id=2" [1]
02:28:00 URL:http://www.acme-art.com/catalogue/catalogue.cgi?id=3 [1282] ->
"www.acme-art.com/catalogue/catalogue.cgi?id=3" [1]
02:28:00 URL:http://www.acme-art.com/news/news.cgi/8_Feb_2001.html [1825] ->
"www.acme-art.com/news/news.cgi/8_Feb_2001.html" [1]
02:28:00 URL:http://www.acme-art.com/news/print.cgi [941] -> "www.acme-art.com/news/print.cgi" [1]
02:28:00 URL:http://www.acme-art.com/news/news.cgi/12_Apr_2001.html [1884] ->
"www.acme-art.com/news/news.cgi/12_Apr_2001.html" [1]
02:28:01 URL:http://www.acme-art.com/news/news.cgi/14_May_2001.html [1940] ->
"www.acme-art.com/news/news.cgi/14_May_2001.html" [1]
02:28:01 URL:http://www.acme-art.com/news/news.cgi/22_May_2001.html [1870] ->
"www.acme-art.com/news/news.cgi/22_May_2001.html" [1]
02:28:01 URL:http://www.acme-art.com/news/news.cgi/8_Dec_2001.html [1339] ->
"www.acme-art.com/news/news.cgi/8_Dec_2001.html" [1]
FINISHED --02:28:01--
Downloaded: 343,279 bytes in 28 files

The wget retriever created the subdirectory www.acme-art.com and began crawling the Web site from the starting point. As it crawled along, it saved the HTML output in files and created further subdirectories as necessary to preserve the overall site layout and structure. Once it finished, the team quickly analyzed the site structure by giving the tree command to display the subdirectory tree below the www.acme-art.com directory:

root@blackbox:~/wget# tree
.
`-- www.acme-art.com
  |-- catalogue
  |  |-- catalogue.cgi?id=0
  |  |-- catalogue.cgi?id=1
  |  |-- catalogue.cgi?id=2
  |  |-- catalogue.cgi?id=3
  |  |-- index.html
  |  `-- print.cgi
  |-- guestbook
  |  |-- addguest.html
  |  `-- guestbook.html
  |-- images
  |  |-- falls.jpg
  |  |-- fl_thumb.jpg
  |  |-- flower.jpg
  |  |-- mn_thumb.jpg
  |   |-- montrey.jpg
  |  |-- tahoe1.jpg
  |  |-- th_thumb.jpg
  |  `-- yf_thumb.jpg
  |-- index.cgi?page=falls.shtml
  |-- index.cgi?page=flower.shtml
  |-- index.cgi?page=montrey.shtml
  |-- index.cgi?page=tahoe1.shtml
  |-- index.html
   `-- news
  |-- index.html
  |-- news.cgi
  |  |-- 12_Apr_2001.html
  |  |-- 14_May_2001.html
  |  |-- 22_May_2001.html
  |  |-- 8_Dec_2001.html
  |  `-- 8_Feb_2001.html
  `-- print.cgi
 
6 directories, 28 files

Each file saved in these directories contains HTML output from the corresponding files or scripts on http://www.acme-art.com. Now we are set to search for clues in the HTML source code.

Using grep

The quickest way of looking through the mirrored HTML code is to use grep to look for patterns. Windows users have a similar utility, findstr, at their disposal. The following list shows how grep is used to sift through the HTML code for clues that we've discussed:

Elements

Pattern

grep syntax

HTML comments

<!-- -->

grep r '<!--' *

Internal/external hyperlinks

HREF, ACTION

grep r i 'href=|action=' *

E-mail addresses

@

grep r '@' *

Keywords/meta tags

<META

grep r i '<meta' *

Hidden fields

TYPE=HIDDEN

grep r i 'type=hidden' *

Client-side scripts

<SCRIPT

grep r i '<script' *

Here is what the output from grep looks like:

root@blackbox:~/wget/www.acme-art.com# grep -r -i 'hidden' *
index.cgi?page=falls.shtml:<INPUT TYPE=HIDDEN NAME=_INIFILE VALUE="cart.ini">
index.cgi?page=falls.shtml:<INPUT TYPE=HIDDEN NAME=_ACTION VALUE="ADD">
index.cgi?page=falls.shtml:<INPUT TYPE=HIDDEN NAME=_PCODE VALUE="88-001">
index.cgi?page=tahoe1.shtml:<INPUT TYPE=HIDDEN NAME=_INIFILE VALUE="cart.ini">
index.cgi?page=tahoe1.shtml:<INPUT TYPE=HIDDEN NAME=_ACTION VALUE="ADD">
index.cgi?page=tahoe1.shtml:<INPUT TYPE=HIDDEN NAME=_PCODE VALUE="88-002">
index.cgi?page=montrey.shtml:<INPUT TYPE=HIDDEN NAME=_INIFILE VALUE="cart.ini">
index.cgi?page=montrey.shtml:<INPUT TYPE=HIDDEN NAME=_ACTION VALUE="ADD">
index.cgi?page=montrey.shtml:<INPUT TYPE=HIDDEN NAME=_PCODE VALUE="88-003">
index.cgi?page=flower.shtml:<INPUT TYPE=HIDDEN NAME=_INIFILE VALUE="cart.ini">
index.cgi?page=flower.shtml:<INPUT TYPE=HIDDEN NAME=_ACTION VALUE="ADD">
index.cgi?page=flower.shtml:<INPUT TYPE=HIDDEN NAME=_PCODE VALUE="88-001">

Note how cart.ini is displayed along with the hidden field _INIFILE. If this simple test for information leakage was performed before www.acme-art.com went online, the hack attempt probably would have been foiled.

 



Web Hacking(c) Attacks and Defense
Web Hacking: Attacks and Defense
ISBN: 0201761769
EAN: 2147483647
Year: 2005
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net