7.8. Web Page Indexing

7.8. Web Page Indexing

Over the past ten years , the Internet has grown to such dimensions that it has become impossible to find something in it without a good search system. The first search systems simply indexed Internet pages by their contents and then used the obtained database for searches, which produced rough matches. Most languages have words with double or even multiple meanings, which makes search by such words difficult.

The problem lies not only in the words with numerous meanings. There are many commonly-used expressions that are difficult to apply when conducting a search. These factors forced search systems to develop better search algorithms, and now a search can be requested based on a combination of various parameters. One of the today's most powerful search systems is Google ( www.google.com ). It offers many options to make the search more precise. Unfortunately, most users have not mastered these options, but hackers have and use them for nefarious purposes.

One of the simplest ways to use a search system for breaking into a server is to use it to find a closed Web page. Some sites have areas that can be accessed only through a password. Such sites include paid resources, for which the protection is based only on checking the password when entering the system; individual pages are not protected, and SSL is not used. In this case, Google can index the pages on closed sites and they can be viewed through the search system. You just need to have an exact idea what information is stored in the file, and to compose the search criteria as precisely as possible.

Google can be helpful in unearthing quite important information not intended for public viewing, which becomes accessible to the Google indexing engine because of a mistake by the administrator. For the search to be successful, you need to specify correct parameters. For example, the results of entering Annual report filetype:doc into the search line will be all Word documents containing the words "annual report."

Most likely, the number of the documents found will be too great and you will have to narrow the search criteria. Persevere and you'll succeed. There are real-life examples, in which confidential data, including valid credit card numbers and financial accounts, were obtained using this simple method.

Consider how indexing of Web pages that are not supposed to be open to public can be disallowed . For this, you have to understand what search systems index. The answer is simple: They index everything they come across texts , names, picture names , documents in various formats (PDF, XLS, DOC, etc.), and so on.

Your task is to limit the search robots' doggedness so that they do not index the stuff you don't want them to. This is done by sending the robot a certain signal. How is this done? The solution is simple yet elegant: A file named robots.txt containing rules for search robots to follow is placed in the site's root.

Suppose that a robot is about to index the www.your_name.com site. Before it starts doing this, the robot will try to load the www.your_name.com/robots.txt file. If it succeeds, it will index the site following the rules described in the file; otherwise , the contents of the entire site will be indexed.

The format of the file is simple: It uses only two directives. These are the following:

  • User -Agent: parameter The value of parameter is the name of the search system covered by the prohibition . There can be more that one such entry in the file, each describing an individual search system. If the prohibitions apply to all search systems, the value of parameter is set to *.

  • Disallow: address This prohibits indexing of the indicated address, specified with respect to the URL. For example, indexing of pages from www.your_name.com/admin is prohibited by setting address to /admin/ . The address is specified relative to the URL and not relative to the file system, because the search system cannot know the location of files on the server's disk and operates only with URL addresses.

The following is an example of the robots.txt file that prohibits all search robots from indexing pages located at the URLs www.your_name.com/admin and www.your_name.com/cgi_bin :

 User-Agent: * Disallow: /cgi-bin/ Disallow: /admin/ 

The prohibitions set by the preceding rules also apply to subdirectories in the specified directories. Thus, files located at www.your_name.com/cgi_bin/forum will not be indexed. The following example prohibits the site from being indexed:

 User-Agent: * Disallow: / 

If your site contains a directory with confidential data, you should disallow it to be indexed. But you should not become carried away and prohibit indexing altogether; this will prevent it from being included in searches and you stand to lose potential visitors. According to statistics, the number of visitors directed to sites by search engines is greater than the number of visitors coming from elsewhere.



Hacker Linux Uncovered
Hacker Linux Uncovered
ISBN: 1931769508
EAN: 2147483647
Year: 2004
Pages: 141

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net