Recipe10.4.Calculating Apache Hits per IP Address

Recipe 10.4. Calculating Apache Hits per IP Address

Credit: Mark Nenadov, Ivo Woltring

Problem

You need to examine a log file from Apache to count the number of hits recorded from each individual IP address that accessed it.

Solution

Many of the chores of administering a web server have to do with analyzing Apache logs, which Python makes easy:

def calculateApacheIpHits(logfile_pathname):    ''' return a dict mapping IP addresses to hit counts '''     ipHitListing = {  }     contents = open(logfile_pathname, "r")     # go through each line of the logfile     for line in contents:         # split the string to isolate the IP address         ip = line.split(" ", 1)[0]         # Ensure length of the IP address is proper (see discussion)         if 6 < len(ip) <= 15:             # Increase by 1 if IP exists; else set hit count = 1             ipHitListing[ip] = ipHitListing.get(ip, 0) + 1    return ipHitListing

Discussion

This recipe supplies a function that returns a dictionary containing the hit counts for each individual IP address that has accessed your Apache web server, as recorded in an Apache log file. For example, a typical use would be:

HitsDictionary = calculateApacheIpHits(                  "/usr/local/nusphere/apache/logs/access_log")

This function has many quite useful applications. For example, I often use it in my code to determine the number of hits that are actually originating from locations other than my local host. This function is also used to chart which IP addresses are most actively viewing the pages that are served by a particular installation of Apache.

This function performs a modest validation of each IP address, which is really just a length check: an IP address cannot be longer than 15 characters (4 sets of triplets and 3 periods) nor shorter than 7 (4 sets of single digits and 3 periods). This validation is not stringent, but it does reduce, at tiny runtime cost, the probability of placing into the dictionary some data that is obviously garbage. As a general technique, low-cost, highly approximate sanity checks for data that is expected to be OK (but one never knows for sure) are worth considering. However, if you want to be stricter, regular expressions can help. Change the loop in this recipe's function's body to:

    import re     # an IP is: 4 strings, each of 1-3 digits, joined by periods     ip_specs = r'\.'.join([r'\d{1,3}']*4)     re_ip = re.compile(ip_specs)     for line in contents:         match = re_ip.match(line)         if match:             # Increase by 1 if IP exists; else set hit count = 1             ip = match.group( )             ipHitListing[ip] = ipHitListing.get(ip, 0) + 1

In this variant, we use a regular expression to extract and validate the IP at the same time. This approach enables us to avoid the split operation as well as the length check, and thus amortizes most of the runtime cost of matching the regular expression. This variant is only a few percentage points slower than the recipe's solution.

Of course, the pattern given here as ip_specs is not entirely precise either, since it accepts, as components of an IP quad, arbitrary strings of one to three digits, while the components should be more constrained. But to ward off garbage lines, this level of sanity check is sufficient.

Another alternative is to convert and check the address: extract string ip just as we do in this recipe's Solution, then:

        # Ensure the IP address is proper         try:             quad = map(int, ip.split('.'))         except ValueError:             pass         else:             if len(quad)==4 and min(quad)>=0 and max(quad)<=255:                 # Increase by 1 if IP exists; else set hit count = 1                 ipHitListing[ip] = ipHitListing.get(ip, 0) + 1

This approach is more work, but it does guarantee that only IP addresses that are formally valid get counted at all.

Recipe10.4.Calculating Apache Hits per IP Address