Hack 40. Calculate Rank
The RANK( ) function introduced in ISO SQL:2003 has been implemented in Oracle and SQL Server. RANK( ) allows the efficient calculation of rank position. Suppose you have the results of the sales figures achieved by your sales force, as shown in Table 5-23. Table 5-23. The sales table
You can use the ORDER BY clause to see these by value or by volume:
SQL>
SELECT * FROM sales ORDER BY totValue DESC;
ID NAME TOTVALUE TOTVOLUME
---------- -------------------- ---------- ----------
1 Loman 4000 49
5 Mitchell 4000 96
2 Miller 3000 49
3 Hoffman 3000 85
4 Cobb 2000 66
With RANK( ) you can return the rank position for any column: SQL> SELECT name, 2 totValue, 3 RANK( ) OVER (ORDER BY totValue DESC) r 4 FROM sales 5 ORDER BY totValue DESC; NAME TOTVALUE R -------------------- ---------- ---------- Loman 4000 1 Mitchell 4000 1 Miller 3000 3 Hoffman 3000 3 Cobb 2000 5 Better yet, you can show the rank position for more than one column at a time and you can order the result any way you choose: SQL> SELECT name, 2 totValue, 3 RANK( ) OVER (ORDER BY totValue DESC) rVal, 4 totVolume, 5 RANK( ) OVER (ORDER BY totVolume DESC) rVol 6 FROM sales 7 ORDER BY name; NAME TOTVALUE RVAL TOTVOLUME RVOL ---------- ---------- ---------- ---------- ---------- Cobb 2000 5 66 3 Hoffman 3000 3 85 2 Loman 4000 1 49 4 Miller 3000 3 49 4 Mitchell 4000 1 96 1
You can see that Mitchell and Loman are tied for top salesman by value, but Mitchell is also the top
|
Chapter 6. Online Applications
An SQL database provides a powerful solution to many problems that arise in web-based applications. In addition to storing text and image data, you can use SQL to assist in building
Despite these benefits, however, it can be dangerous to use SQL in online applications. This chapter includes advice on how to exploit an SQL injection vulnerability and how to avoid becoming a victim of such an attack. |
Hack 41. Copy Web Pages into a Table
You can copy data from web pages into SQL using XSLT, which lets you pick and choose which
If the data in a web page is formatted consistently, you can write an XSLT stylesheet to convert it directly into SQL statements. You can target almost any kind of HTML web page, but it's easier if the source is well-
Figure 6-1. Highest-grossing films according to Wikipedia
After you
1> SELECT * FROM film ORDER BY 2 DESC 2> GO title gross --------------------------------------------------- -------------- Gone With the Wind 2699710936 Snow White and the Seven Dwarfs 2425862786 Titanic 2174317554 Star Wars Episode IV: A New Hope 1436811009 Jurassic Park 1202648438 Bambi 1191311757 The Lord of the Rings: The Return of the King 1175528250 Star Wars Episode I: The Phantom Menace 1054205059 6.1.1. XSLT Processing
The technique used here is based on
XSLT processing
. An XSLT stylesheet describes a transformation from an XML source to some other format. Often XSLT is used to translate one XML format into another XML format; here you will be using it to translate XML (
You will need an XSLT processor and an XSLT stylesheet for this process. If you are on a Windows system you can use
6.1.2. The Input DocumentTo extract data from an XHTML document you might need to use a little trial and error. You need to look at the raw HTML from the target page and identify the tag or tags that contain the data you are looking for. Look at the HTML from Wikipedia; this section is part of a much larger document (http://en.wikipedia.org/wiki/List_of_highest-grossing_films): <table class="wikitable"> <caption><b>List of highest-grossing films (adjusted)</b></caption> <tr> <th>Rank</th> <th>Movie name</th> <th>Worldwide Gross</th> </tr> <tr> <td>1</td> <td><i><a href="/wiki/Gone_with_the_Wind_%28film%29" title="Gone with the Wind (film)"> Gone With the Wind </a></i> (<a href="/wiki/1939" title="1939">1939</a>)</td> <td> ,699,710,936 </td> </tr> <tr> <td>2</td> <td><i><a href="/wiki/Snow_White_and_the_Seven_Dwarfs_%281937_film%29" title="Snow White and the Seven Dwarfs (1937 film)"> Snow White and the Seven Dwarfs </a></i> (<a href="/wiki/1937" title="1937">1937</a>)</td> <td> ,425,862,786 </td> </tr>
You must identify enough of the
Unfortunately, this document includes several other tables and tr elements, so you must be more discriminating. The one table that you are interested in is uniquely identified by its caption. You can refine your XPath expression to: htm:table [htm:caption='List of highest-grossing films (adjusted)' ] /htm:tr The square brackets introduce a condition on the match. You are still matching TR nodes, but now you are insisting that the parent of the TR is a table and that the table has a caption child with the specified content. This expression will still match too many tr elements. The first TR element of the table contains the headings and you do not want that to match. You can refine your XPath to ensure that only TR nodes that have TD children are matched:
htm:table[htm:caption='List of highest-grossing films (adjusted)'
]/htm:tr[
htm:td
]
Having identified the elements that represent each row of your table you can create a template to match them. The content of the template will be an SQL statement with xsl:value-of elements in place of actual data values. 6.1.3. gross.xslHere is the stylesheet. You can save the sheet as the file gross.xsl :
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:htm="http://www.w3.org/1999/xhtml"
version="1.0"
>
<xsl:output omit-xml-declaration='yes'/>
<xsl:template
match="htm:table[
htm:caption='List of highest-grossing films (adjusted)'
]/htm:tr[htm:td]">
INSERT INTO film VALUES (
'<xsl:value-of select="htm:td[2]/htm:i"/>',
<xsl:value-of select="translate(htm:td[3],'$,','')"/>)
GO
</xsl:template>
<xsl:template match="text( )"/>
</xsl:stylesheet>
The file
gross.xsl
contains a single useful template. For every
tr
element in the document that matches the XPath expression, the template is activated. At each activation, the
INSERT
statement is generated and the
xsl:value-of
elements get
Notice the two
xsl:value-of
elements that return the title of the movie and the gross takings. The expression
htm:td[2]/htm:i
gives the text content of the
i
element contained in the second
td
element. This is the title of the movie. The gross figure is in the third
TD
element. The transformation
The square brackets may contain either a condition or a number. When a number is given, as in htm:td[2] , it is equivalent to htm:td[position( )=2] . Here is a sample of the output generated:
INSERT INTO film VALUES (
'Gone With the Wind',
2699710936)
GO
INSERT INTO film VALUES (
'Snow White and the Seven Dwarfs',
2425862786)
GO
This example is intended for SQL Server; that's why there's a GO command to mark the end of each batch. Most other systems require a semicolon in place of the word GO . 6.1.4. Running the Hack
The XSLT processor will take a page directly from the Web, and you can store the results in the file
gross.sql
before loading it into SQL Server. In this example, the stylesheet,
gross.xsl
, is in the current directory, but it can be in any directory or even in a remote URL. You can run the hack from a Windows command prompt as
C:> msxsl http://en.wikipedia.org/wiki/List_of_highest-grossing_films gross.xsl -o gross.sql C:> sqlcmd -E -S (local)\SQLExpress -d dbname 1> CREATE TABLE film (title VARCHAR(256), gross BIGINT) 2> GO 1> QUIT C:> sqlcmd -E -S (local)\SQLExpress -d dbname -i gross.sql (1 row affected) (1 row affected) (1 row affected) (1 row affected) (1 row affected) (1 row affected) (1 row affected) Msg 102, Level 15, State 1, Server PUMA\SQLEXPRESS, Line 3 Incorrect syntax near 's'. Msg 105, Level 15, State 1, Server PUMA\SQLEXPRESS, Line 3 Unclosed quotation mark after the character string ', 1058997333) '. (1 row affected) (1 row affected)
Notice that one of the films,
Harry Potter and the Sorcerer's Stone
, includes an apostrophe that has caused an error. We'll fix that in the
$
wget -O source.htm http://en.wikipedia.org/wiki/List_of_highest-grossing_
films
--23:17:49-- http://en.wikipedia.org/wiki/List_of_highest-grossing_films
=> \Qsource.htm'
Resolving en.wikipedia.org... 145.97.39.155
Connecting to en.wikipedia.org145.97.39.155:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 34,831 27.48K/s
23:17:50 (27.41 KB/s) - \Qsource.htm' saved [34831]
$
xsltproc -o gross.sql gross.xsl source.htm
$
mysql -u scott -ptiger dbname -e 'source gross.sql'
ERROR 1064 (42000) at line 30 in file: 'gross.sql': You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 's Stone',
1058997333);
INSERT INTO film VALUES (
'Star ' at line 2
The
The unbalanced quote in the title "The Sorcerer's Stone" is
6.1.4.1. Processing requiredYou can take care of quotes using the Perl one-liner on Windows ( type cleans up some of the messy Unicode you might get out of msxsl.exe ): type gross.sql perl -pe "s/'/''/g;s/''/'/;s/'',$/',/;" > grossQ.sql On Linux, Unix, or Mac OS X, the command is: perl -pe "s/'/''/g;s/''/'/;s/'',$/',/;" < gross.sql > grossQ.sql
This first substitution,
s/'/''/g
,
You should use your command-line SQL utility [Hack #1] to run the command DELETE FROM film before you load grossQ.sql or you will end up with some duplicates from your previous attempt. 6.1.5. Hacking the HackPages from Wikipedia are valid XHTML, and that makes them particularly suitable for this technique. If your source document is not well formed the XSLT process will halt with an error.
You can clean up many
You can use the Linux utility
The Internet Movie Database (IMDb) has a similar list of top-grossing films at http://www.imdb.com/boxoffice/alltimegross?region=world-wide, and this
You can download the page as before:
$
wget -O gross1.htm \
'http://www.imdb.com/boxoffice/alltimegross?region=world-wide'
--12:29:42-- http://www.imdb.com/boxoffice/alltimegross?region=world-wide
=> \Qgross1.htm'
Resolving www.imdb.com... done.
Connecting to www.imdb.com[207.171.166.140]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 80,936 233.15K/s
12:29:43 (233.15 KB/s) - \Qgross1.htm' saved [80936]
You can try to run this through Tidy, but when you do you get a bunch of errors and warnings. You can ignore the warnings but you must take care of the errors: $ tidy q -numeric -asxhtml gross1.htm line 2 column 1 - Warning: missing <!DOCTYPE> declaration line 23 column 1 - Warning: <layer> is not approved by W3C line 85 column 1 - Warning: missing </form> before </td> line 110 column 6 - Error: discarding unexpected </form> ... The error on line 110 is caused by an improperly nested tag. Tidy can cope with many common XML errors, but not that one. The quick and dirty solution to the unrecognized or badly nested tag is simply to get rid of it and hope that you don't lose any important data. grep will do that, and you can follow that up with a more successful run of Tidy: $ grep -v '</form>' gross1.htm grep -v 'imdb:roundend' > gross2.htm $ tidy -o gross3.htm -q -asxhtml -numeric gross2.htm line 2 column 1 - Warning: missing <!DOCTYPE> declaration line 23 column 1 - Warning: <layer> is not approved by W3C line 85 column 1 - Warning: missing <td> line 86 column 1 - Warning: discarding unexpected <td> line 85 column 1 - Warning: missing </form> before </table> ... The Tidy program now gives even more warnings than before, but no errors, and gross3.htm now contains valid XML that you can process with XSLT. Instead of examining the file by hand, you can get xmllint to find the structures you need; see Hack #35, "Explore a Document Tree with the xmllint Shell," in XML Hacks by Michael Fitzgerald (O'Reilly), for more information on this: $ xmllint --shell gross3.htm / > grep Potter /html/body/div[2]/layer/table[3]/tr/td/table[2]/tr/td[2]/table/tr/td[1] /table[1]/tr [4]/td[2]/a : t-- 37 Harry Potter and the Sorcerer's Stone /html/body/div[2]/layer/table[3]/tr/td/table[2]/tr/td[2]/table/tr/td[1] /table[1]/tr [8]/td[2]/a : t-- 35 Harry Potter and the Goblet of Fire /html/body/div[2]/layer/table[3]/tr/td/table[2]/tr/td[2]/table/tr/td[1] /table[1]/tr [10]/td[2]/a : t-- 39 Harry Potter and the Chamber of Secrets /html/body/div[2]/layer/table[3]/tr/td/table[2]/tr/td[2]/table/tr/td[1] /table[1]/tr [18]/td[2]/a : t-- 40 Harry Potter and the Prisoner of Azkaban... / > grep 968,657,891 /html/body/div[2]/layer/table[3]/tr/td/table[2]/tr/td[2]/table/tr/td[1] /table[1]/tr [4]/td[3] : t-- 12 8,657,891 You use the grep command to identify strings that you know to be in the document. In the preceding sequence, the string Potter crops up in several movie titles and the figure 968,657,891 appears as the box office takings for the first Potter film. You can see that there are "Potter" titles in tr[4]/td[2]/a and tr[8]/td[2]/a , and in two other places. Each is in a nest of tables four deep. The figure 968,657,891 appears in the gross value for the first Potter film. That value crops up in TR[4]/td[3] and it is the tr[4] node that contains the film title. You can use this information to create your template in the XSL sheet. This sheet has much in common with the previous one. This time we've not tried to be smart about the XPath expression, so this sheet will be more fragile with respect to trivial formatting changes made by IMDb. The match expression in the template is simply the pattern given by xmllint , but with the namespace htm: prepended to each element name. Save this in a file named imdb.xsl :
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:htm="http://www.w3.org/1999/xhtml"
version="1.0"
>
<xsl:output omit-xml-declaration='yes'/>
<xsl:template match="/htm:html/htm:body/htm:div[2]/htm:layer/htm:table[3]/
htm:tr/htm:td/htm:table[2]/htm:tr/htm:td[2]/htm:table/
htm:tr/htm:td[1]/htm:table[1]/htm:tr[position( )>1]">
INSERT INTO film VALUES (
'<xsl:value-of select="normalize-space(htm:td[2]/htm:a)"/>',
<xsl:value-of select='translate(htm:td[3],"$,","")'/>
);
</xsl:template>
<xsl:template match="text( )"/>
</xsl:stylesheet>
The translate function is doing precisely the same job as before, stripping out the dollar signs and the commas. There are line-end characters in the titles this time and the normalize-space function takes care of those. You need to use the [position( )>1] condition to skip the first tr in the list returned; the first row holds the headers. You can run the sheet using xsltproc : $ xsltproc imdb.xsl gross3.htm > gross4.sql You will have the same problem with single quotes in movie titles, and you can solve it the same way: $ perl -pe "s/'/''/g;s/''/'/;s/'',/',/" < gross4.sql > gross5.sql Finally, you can put your data into MySQL and look at the results: $ mysql u scott ptiger scott < gross5.sql $ mysql -u scott -ptiger scott -e 'select * from film' +----------------------------------------------------+------------+ title gross +----------------------------------------------------+------------+ Titanic 1835300000 The Lord of the Rings: The Return of the King 1129219252 Harry Potter and the Sorcerer's Stone 968657891 ... |