Recipe 4.11 Program: Data Mining

Suppose that I, as a published author, want to track how my book is selling in comparison to others. This information can be obtained for free just by clicking on the page for my book on any of the major bookseller sites, reading the sales rank number off the screen, and typing the number into a file but that's too tedious. As I wrote in the book that this example looks for, "computers get paid to extract relevant information from files; people should not have to do such mundane tasks." This program uses the Regular Expressions API and, in particular, newline matching to extract a value from an HTML page on the hypothetical QuickBookShops.web web site. It also reads from a URL object (see Recipe 18.7). The pattern to look for is something like this (bear in mind that the HTML may change at any time, so I want to keep the pattern fairly general):

<b>QuickBookShop.web Sales Rank: </b> 26,252 </font><br>

As the pattern may extend over more than one line, I read the entire web page from the URL into a single long string using my FileIO.readerToString( ) method (see Recipe 10.8) instead of the more traditional line-at-a-time paradigm. I then plot a graph using an external program (see Recipe 26.1); this could (and should) be changed to use a Java graphics program (see Recipe 13.13 for some leads). The complete program is shown in Example 4-8.

Example 4-8.
// Standard imports not shown import; import com.darwinsys.util.FileProperties; /** Graph of a book's sales rank on a given bookshop site.  * @author Ian F. Darwin,, Java Cookbook author,  *    originally translated fairly literally from Perl into Java.  * @author Patrick Killelea <>: original Perl version,  *    from the 2nd edition of his book "Web Performance Tuning".  * @version $Id: ch04.xml,v 1.4 2004/05/04 20:11:27 ian Exp $  */ public class BookRank {     public final static String DATA_FILE = "book.sales";     public final static String GRAPH_FILE = "book.png";     /** Grab the sales rank off the web page and log it. */     public static void main(String[] args) throws Exception {         Properties p = new FileProperties(             args.length == 0 ? "" : args[1]);         String title = p.getProperty("title", "NO TITLE IN PROPERTIES");         // The url must have the "isbn=" at the very end, or otherwise         // be amenable to being string-catted to, like the default.         String url = p.getProperty("url", "");         // The 10-digit ISBN for the book.         String isbn  = p.getProperty("isbn", "0000000000");         // The regex pattern (MUST have ONE capture group for the number)         String pattern = p.getProperty("pattern", "Rank: (\\d+)");         // Looking for something like this in the input:         //     <b>QuickBookShop.web Sales Rank: </b>         //     26,252         //     </font><br>         Pattern r = Pattern.compile(pattern);         // Open the URL and get a Reader from it.         BufferedReader is = new BufferedReader(new InputStreamReader(             new URL(url + isbn).openStream( )));         // Read the URL looking for the rank information, as         // a single long string, so can match regex across multi-lines.         String input = FileIO.readerToString(is);         // System.out.println(input);         // If found, append to sales data file.         Matcher m = r.matcher(input);         if (m.find( )) {             PrintWriter pw = new PrintWriter(                 new FileWriter(DATA_FILE, true));             String date = // 'date +'%m %d %H %M %S %Y'`;                 new SimpleDateFormat("MM dd hh mm ss yyyy ").                 format(new Date( ));             // Paren 1 is the digits (and maybe ','s) that matched; remove comma             Matcher noComma = Pattern.compile(",").matcher(;             pw.println(date + noComma.replaceAll(""));             pw.close( );         } else {             System.err.println("WARNING: pattern `" + pattern +                 "' did not match in `" + url + isbn + "'!");         }         // Whether current data found or not, draw the graph, using          // external plotting program against all historical data.         // Could use gnuplot, R, any other math/graph program.         // Better yet: use one of the Java plotting APIs.         String gnuplot_cmd =              "set term png\n" +              "set output \"" + GRAPH_FILE + "\"\n" +             "set xdata time\n" +             "set ylabel \"Book sales rank\"\n" +             "set bmargin 3\n" +             "set logscale y\n" +             "set yrange [1:60000] reverse\n" +             "set timefmt \"%m %d %H %M %S %Y\"\n" +             "plot \"" + DATA_FILE +                  "\" using 1:7 title \"" + title + "\" with lines\n"          ;         Process proc = Runtime.getRuntime( ).exec("/usr/local/bin/gnuplot");         PrintWriter gp = new PrintWriter(proc.getOutputStream( ));         gp.print(gnuplot_cmd);         gp.close( );     } }

Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

