85 Understanding Search Engine Traffic | Wicked Cool Shell Scripts

#85 Understanding Search Engine Traffic

Script #84, Exploring the Apache access_log , can offer a broad-level overview of some of the search engine queries that point to your site, but further analysis can reveal not just which search engines are delivering traffic, but what keywords were entered by users who arrived at your site via search engines. This information can be invaluable for understanding whether your site has been properly indexed by the search engines and can provide the starting point for improving the rank and relevancy of your search engine listings.

The Code

 #!/bin/sh # searchinfo - Extracts and analyzes search engine traffic indicated in the #    referrer field of a Common Log Format access log. host="intuitive.com"    # change to your domain, as desired maxmatches=20 count=0 temp="/tmp/$(basename  #!/bin/sh # searchinfo - Extracts and analyzes search engine traffic indicated in the # referrer field of a Common Log Format access log. host="intuitive.com" # change to your domain, as desired maxmatches=20 count=0 temp="/tmp/$(basename $0).$$" trap "/bin/rm -f $temp" 0 if [ $# -eq 0 ] ; then echo "Usage: $(basename $0) logfile" >&2 exit 1 fi if [ ! -r "$1" ] ; then echo "Error: can't open file $1 for analysis." >&2 exit 1 fi for URL in $(awk '{ if (length($11) > 4) { print $11 } }' "$1"  \ grep -vE "(/www.$host/$host)"  grep '?') do searchengine="$(echo $URL  cut -d/ -f3  rev  cut -d. -f1-2  rev)" args="$(echo $URL  cut -d\? -f2  tr '&' '\n'  \ grep -E '(^q=^sid=^p=query=item=ask= name =topic=)'  \ sed -e 's/+/ /g' -e 's/%20/ /g' -e 's/"//g'  cut -d= -f2)" if [ ! -z "$args" ] ; then echo "${searchengine}: $args" >> $temp else # No well-known match, show entire GET string instead... echo "${searchengine} $(echo $URL  cut -d\? -f2)" >> $temp fi count="$(( $count + 1 ))" done echo "Search engine referrer info extracted from ${1}:" sort $temp  uniq -c sort -rn  head -$maxmatches  sed 's/^/ /g' echo "" echo Scanned $count entries in log file out of $(wc -l < "$1") total. exit 0 
 ).$$" trap "/bin/rm -f $temp" 0 if [ $# -eq 0 ] ; then   echo "Usage: $(basename  #!/bin/sh # searchinfo - Extracts and analyzes search engine traffic indicated in the # referrer field of a Common Log Format access log. host="intuitive.com" # change to your domain, as desired maxmatches=20 count=0 temp="/tmp/$(basename $0).$$" trap "/bin/rm -f $temp" 0 if [ $# -eq 0 ] ; then echo "Usage: $(basename $0) logfile" >&2 exit 1 fi if [ ! -r "$1" ] ; then echo "Error: can't open file $1 for analysis." >&2 exit 1 fi for URL in $(awk '{ if (length($11) > 4) { print $11 } }' "$1"  \ grep -vE "(/www.$host/$host)"  grep '?') do searchengine="$(echo $URL  cut -d/ -f3  rev  cut -d. -f1-2  rev)" args="$(echo $URL  cut -d\? -f2  tr '&' '\n'  \ grep -E '(^q=^sid=^p=query=item=ask= name =topic=)'  \ sed -e 's/+/ /g' -e 's/%20/ /g' -e 's/"//g'  cut -d= -f2)" if [ ! -z "$args" ] ; then echo "${searchengine}: $args" >> $temp else # No well-known match, show entire GET string instead... echo "${searchengine} $(echo $URL  cut -d\? -f2)" >> $temp fi count="$(( $count + 1 ))" done echo "Search engine referrer info extracted from ${1}:" sort $temp  uniq -c sort -rn  head -$maxmatches  sed 's/^/ /g' echo "" echo Scanned $count entries in log file out of $(wc -l < "$1") total. exit 0 
 ) logfile" >&2   exit 1 fi if [ ! -r "" ] ; then   echo "Error: can't open file  for analysis." >&2   exit 1 fi for URL in $(awk '{ if (length() > 4) { print  } }' ""  \   grep -vE "(/www.$host/$host)"  grep '?') do   searchengine="$(echo $URL  cut -d/ -f3  rev  cut -d. -f1-2  rev)"   args="$(echo $URL  cut -d\? -f2  tr '&' '\n'  \      grep -E '(^q=^sid=^p=query=item=ask=name=topic=)'  \      sed -e 's/+/ /g' -e 's/%20/ /g' -e 's/"//g'  cut -d= -f2)"   if [ ! -z "$args" ] ; then     echo "${searchengine}:      $args" >> $temp   else     # No well-known match, show entire GET string instead...     echo "${searchengine}       $(echo $URL  cut -d\? -f2)" >> $temp   fi   count="$(( $count + 1 ))" done echo "Search engine referrer info extracted from :" sort $temp  uniq -c  sort -rn  head -$maxmatches  sed 's/^/ /g' echo "" echo Scanned $count entries in log file out of $(wc -l < "") total. exit 0

How It Works

The main for loop of this script extracts all entries in the log file that have a valid referrer with a string length greater than 4, a referrer domain that does not match the $host variable, and a ? in the referrer string (indicating that a user search was performed):

 for URL in $(awk '{ if (length() > 4) { print  } }' ""  \   grep -vE "(/www.$host/$host)"  grep '?')

The script then goes through various steps in the ensuing lines to identify the domain name of the referrer and the search value entered by the user:

 searchengine="$(echo $URL  cut -d/ -f3  rev  cut -d. -f1-2  rev)"   args="$(echo $URL  cut -d\? -f2  tr '&' '\n'  \      grep -E '(^q=^sid=^p=query=item=ask=name=topic=)'  \      sed -e 's/+/ /g' -e 's/%20/ /g' -e 's/"//g'  cut -d= -f2)"

An examination of hundreds of search queries shows that common search sites use a small number of common variable names . For example, search on Yahoo.com and your search string is p=pattern . Google and MSN use q as the search variable name. The grep invocation contains p , q , and the other most common search variable names.

The last line, the invocation of sed , cleans up the resultant search patterns, replacing + and %20 sequences with spaces and chopping quotes out, and then the cut command returns everything that occurs after the first equal ( = ) sign ” in other words, just the search terms.

The conditional immediately following these lines tests to see if the args variable is empty or not. If it is (that is, if the query format isn't a known format), then it's a search engine we haven't seen, so we output the entire pattern rather than a cleaned-up pattern-only value.

Running the Script

To run this script, simply specify the name of an Apache or other Common Log Format log file on the command line.

Speed warning!

This is one of the slowest scripts in this book, because it's spawning lots and lots of subshells to perform various tasks , so don't be surprised if it takes a while to run.

The Results

 $  searchinfo /web/logs/intuitive/access_log  Search engine referrer info extracted from /web/logs/intuitive/access_log:     19 msn.com:      little big horn     14 msn.com:      custer     11 google.com:      cool web pages     10 msn.com:      plains      9 msn.com:      Little Big Horn      9 google.com:      html 4 entities      6 msn.com:      Custer      4 msn.com:      the plains indians      4 msn.com:      little big horn battlefield      4 msn.com:      Indian Wars      4 google.com:      newsgroups      3 yahoo.com:      cool web pages      3 ittoolbox.com       i=1186"      3 google.it:      jungle book kipling plot      3 google.com:      cool web graphics      3 google.com:      colored bullets CSS      2 yahoo.com:      unix%2Bhogs      2 yahoo.com:      cool HTML tags      2 msn.com:      www.custer.com Scanned 466 entries in log file out of 11406 total.

Hacking the Script

You can tweak this script in a variety of ways to make it more useful. One obvious one is to skip the referrer URLs that are (most likely) not from search engines. To do so, simply comment out the else clause in the following passage:

 if [ ! -z "$args" ] ; then     echo "${searchengine}:      $args" >> $temp   else     # No well-known match, show entire GET string instead...     echo "${searchengine}       $(echo $URL  cut -d\? -f2)" >> $temp   fi

To be fair, ex post facto analysis of search engine traffic is difficult. Another way to approach this task would be to search for all hits coming from a specific search engine, entered as the second command argument, and then to compare the search strings specified. The core for loop would change, but, other than a slight tweak to the usage message, the script would be identical to the searchinfo script:

 for URL in $(awk '{ if (length() > 4) { print  } }' ""  \   grep ) do   args="$(echo $URL  cut -d\? -f2  tr '&' '\n'  \      grep -E '(^q=^sid=^p=query=item=ask=name=topic=)'  \      cut -d= -f2)"   echo $args  sed -e 's/+/ /g' -e 's/"//g' >> $temp   count="$(($count + 1))" done

The results of this new version, given google.com as an argument, are as follows :

 $  enginehits /web/logs/intuitive/access_log google.com  Search engine referrer info extracted google searches from /web/logs/intuitive/access_log:     13 cool web pages     10      9 html 4 entities      4 newsgroups      3 solaris 9      3 jungle book kipling plot      3 intuitive      3 cool web graphics      3 colored bullets CSS      2 sun solaris operating system reading material      2 solaris unix      2 military weaponry      2 how to add program to sun solaris menu      2 dynamic html border      2 Wallpaper Nikon      2 HTML for heart symbol      2 Cool web pages      2 %22Military weaponry%22      1 www%2fvoices.com      1 worst garage door opener      1 whatis artsd      1 what%27s meta tag Scanned 232 google entries in log file out of 11481 total.

If most of your traffic comes from a few search engines, you could analyze those engines separately and then list all traffic from other search engines at the end of the output.