Recipe 10.36. Guessing Table Structure from a Datafile


Problem

Someone gives you a datafile and says, "Here, put this into MySQL for me." But no table yet exists to hold the data.

Solution

Write the CREATE TABLE statement yourself. Or use a utility that guesses the table structure by examining the contents of the datafile.

Discussion

Sometimes you need to import data into MySQL for which no table has yet been set up. You can create the table yourself, based on any knowledge you might have about the contents of the file. Or you might be able to avoid some of the work by using guess_table.pl, a utility located in the transfer directory of the recipes distribution. guess_table.pl reads the datafile to see what kind of information it contains, and then attempts to produce an appropriate CREATE TABLE statement that matches the contents of the file. This script is necessarily imperfect, because column contents sometimes are ambiguous. (For example, a column containing a small number of distinct strings might be a VARCHAR column or an ENUM.) Still, it's often easier to tweak the CREATE TABLE statement that guess_table.pl produces than to write the entire statement from scratch. This utility also has a diagnostic function, although that's not its primary purpose. For example, you might believe a column contains only numbers, but if guess_table.pl indicates that it should be created using a VARCHAR type, that tells you the column contains at least one nonnumeric value.

guess_table.pl assumes that its input is in tab-delimited, linefeed-terminated format. It also assumes valid input because any attempt to guess data types based on possibly flawed data is doomed to failure. This means, for example, that if a date column is to be recognized as such, it should be in ISO format. Otherwise, guess_table.pl may characterize it as a VARCHAR column. If a datafile doesn't satisfy these assumptions, you may be able to reformat it first using the cvt_file.pl and cvt_date.pl utilities described in Recipes Section 10.18 and Section 10.31.

guess_table.pl understands the following options:


--labels

Interpret the first input line as a row of column labels, and use them for table column names. If this option is omitted, guess_table.pl uses default column names of c1, c2, and so forth.

Note that if the file contains a row of labels, and you neglect to specify this option, the labels will be treated as data values by guess_table.pl. The likely result is that the script will characterize all columns as VARCHAR columns (even those that otherwise contain only numeric or temporal values), due to the presence of a nonnumeric or nontemporal value in the column.


--lower, --upper

Force column names in the CREATE TABLE statement to be lowercase or uppercase.


--quote-names, --skip-quote-names

Quote or do not quote table and column identifiers in the CREATE TABLE statement with ` characters (for example, `mytbl`). This can be useful if an identifier is a reserved word. The default is to quote identifiers.


--report

Generate a report rather than a CREATE TABLE statement. The script displays the information that it gathers about each column.


--table = tbl_name

Specify the table name to use in the CREATE TABLE statement. The default name is t.

Here's an example of how guess_table.pl works. Suppose that a file named stockdat.csv is in CSV format and has the following contents:

commodity,trade_date,shares,price,change sugar,12-14-2006,1000000,10.50,-.125 oil,12-14-2006,96000,60.25,.25 wheat,12-14-2006,2500000,8.75,0 gold,12-14-2006,13000,103.25,2.25 sugar,12-15-2006,970000,10.60,.1 oil,12-15-2006,105000,60.5,.25 wheat,12-15-2006,2370000,8.65,-.1 gold,12-15-2006,11000,101,-2.25 

The first row indicates the column labels, and the following rows contain data records, one per line. The values in the TRade_date column are dates, but they are in MM-DD-CCYY format rather than the ISO format that MySQL expects. cvt_date.pl can convert these dates to ISO format. However, both cvt_date.pl and guess_table.pl require input in tab-delimited, linefeed-terminated format. So first let's use cvt_file.pl to convert the input to tab-delimited, linefeed-terminated format. Then we can convert the dates with cvt_date.pl:

% cvt_file.pl --iformat=csv stockdat.csv > tmp1.txt % cvt_date.pl --iformat=us tmp1.txt > tmp2.txt             

Then feed the resulting file, tmp2.txt, to guess_table.pl:

% guess_table.pl --labels --table=stocks tmp2.txt > stocks.sql             

The CREATE TABLE statement that guess_table.pl writes to stocks.sql looks like this:

CREATE TABLE `stocks` (   `commodity` VARCHAR(5) NOT NULL,   `trade_date` DATE NOT NULL,   `shares` INT UNSIGNED NOT NULL,   `price` DOUBLE UNSIGNED NOT NULL,   `change` DOUBLE NOT NULL ); 

guess_table.pl produces that statement based on deductions such as the following:

  • If a column contains only numeric values, it's assumed to be an INT if no values contain a decimal point, and DOUBLE otherwise.

  • If a numeric column contains no negative values, the column is likely to be UNSIGNED.

  • If a column contains no empty values, guess_table.pl assumes that it's probably NOT NULL.

  • Columns that cannot be classified as numbers or dates are taken to be VARCHAR columns, with a length equal to the longest value present in the column.

You might want to edit the CREATE TABLE statement that guess_table.pl produces, to make modifications such as increasing the size of character fields, changing VARCHAR to CHAR, or adding indexes. Another reason to edit the statement is that if a column has a name that is a reserved word in MySQL, you can rename it.

To create the table, use the statement produced by guess_table.pl:

% mysql cookbook < stocks.sql             

Then you can load the datafile into the table (skipping the initial row of labels):

mysql> LOAD DATA LOCAL INFILE 'tmp2.txt' INTO TABLE stocks     -> IGNORE 1 LINES;             

The resulting data after import looks like this:

mysql> SELECT * FROM stocks; +-----------+------------+---------+--------+--------+ | commodity | trade_date | shares  | price  | change | +-----------+------------+---------+--------+--------+ | sugar     | 2006-12-14 | 1000000 |   10.5 | -0.125 | | oil       | 2006-12-14 |   96000 |  60.25 |   0.25 | | wheat     | 2006-12-14 | 2500000 |   8.75 |      0 | | gold      | 2006-12-14 |   13000 | 103.25 |   2.25 | | sugar     | 2006-12-15 |  970000 |   10.6 |    0.1 | | oil       | 2006-12-15 |  105000 |   60.5 |   0.25 | | wheat     | 2006-12-15 | 2370000 |   8.65 |   -0.1 | | gold      | 2006-12-15 |   11000 |    101 |  -2.25 | +-----------+------------+---------+--------+--------+ 




MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2004
Pages: 375
Authors: Paul DuBois

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net