3.11. Importing Data in Bulk


While the INSERT and REPLACE statements are useful, they can be time-consuming when you're entering a large number of rows, because they're somewhat manual methods of entering data. Often, when setting up a new database, you will need to migrate data from an old database to MySQL. In the case of our bookstore, let's suppose that a vendor has sent us a disk with a list of all of their books in a simple text file. Each record for each book is on a separate line and each field of each record is separated by a vertical bar. Here's how the fictitious vendor's data text file looks:

ISBN|TITLE|AUTHOR LAST|AUTHOR FIRST|COPYRIGHT DATE| 067973452X|Notes from Underground|Dostoevsky|Fyodor|August 1994| ...

Obviously, an actual vendor file would contain more fields and records than are shown here, but this is enough for our example. The first line contains descriptions of the fields in the records that follow. We don't need to extract the first line; it's just instructions for us. So, we'll tell MySQL to ignore it when we enter our SQL statement. As for the data, we must consider a few problems: the fields are not in the order that they are found in our tables. We'll have to tell MySQL the order in which the data will be coming so that it can make adjustments. The other problem is that this text table contains data for our books table and our authors table. This is going to be a bit tricky, but we can deal with it. What we'll do is extract the author information only in one SQL statement, then we'll run a separate SQL statement to import the book information. To start, we will copy the vendor's file called books.txt to the /tmp directory, and then we will run a LOAD DATA INFILE statement from the mysql client:

LOAD DATA INFILE '/tmp/books.txt' REPLACE INTO TABLE authors FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n' TEXT_FIELDS(col1, col2, col3, col4, col5) SET author_last = col3, author_first = col4 IGNORE col1, col2, col5, 1 LINES;

First, I should point out that the TEXT_FIELDS and the IGNORE clause for columns are not available before Version 4.1 of MySQL. The IGNORE n LINES clause has been around for awhile, though. With IGNORE 1 LINES, the first line of the text file containing the column headings will be ignored. Going back to the first line in the SQL statement here, we've named the file to load and the table in which to load the data. The REPLACE flag has the effect of the REPLACE statement mentioned earlier.

In the second line, we specify that fields are terminated by a vertical bar and that lines are terminated by a carriage return (\r) and a newline (\n) to terminate each line. This is the format for an MS-DOS text file. Unix files have only a newline feed. In the third line, we create aliases for each column. In the fourth line, we name the table columns to receive data and set their values based on the aliases given in the previous line. In the final line, we tell MySQL to ignore the columns that we don't want, as well as the top line, because it doesn't contain data.

If you're using an older version of MySQL that doesn't have this new feature of being able to ignore unwanted columns, you will have to perform a couple of extra steps. There are a few different ways of doing this. One simple way, if the table we're loading data into isn't too large, is to add three extra, temporary columns to authors that will take in the unwanted fields of data from the text file and drop them later. This would look like the following:

ALTER TABLE authors ADD COLUMN col1 VARCHAR(50), ADD COLUMN col2 VARCHAR(50), ADD COLUMN col5 VARCHAR(50);     LOAD DATA INFILE '/tmp/books.txt' REPLACE INTO TABLE authors FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (col1, col2, author_last, author_first, col5);     ALTER TABLE authors DROP COLUMN col1, DROP COLUMN col2, DROP COLUMN col5;

These statements will work, but they're not as graceful as the more straightforward statement shown earlier. In the second SQL statement here, notice that the IGNORE clause specifies one line to be ignored. The last line of the same statement lists the columns in the authors table that are to receive the data and the sequence in which they will be imported. In the third SQL statement, having finished importing the data from the vendor's text file, we now delete the temporary columns with their unnecessary data by using a DROP statement. There's usually no recourse from DROP, no undo. So, take care in using it.

Once we manage to copy the list of authors into the authors table from the text file, we need to load the data on the books and find the correct author_id for each book. We do this through the following:

LOAD DATA INFILE '/tmp/books.txt' IGNORE INTO TABLE books FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n' TEXT_FIELDS(col1, col2, col3, col4, col5) SET isbn = col1, title = col2,      pub_year = RIGHT(col5, 4),     author_id =       (SELECT authors.rec_id         WHERE author_last = col3         AND author_first = col4)  IGNORE col3, col4, 1 LINES;

In this SQL statement, we've added a couple of twists to get what we need. On the fifth line, to extract the year from the copyright field, which contains both the month and the year, we're using the string function RIGHT( ). It captures the last four characters of col5 as specified in the second argument. Starting on the sixth line, we're using a subquery to determine the author_id based on data from the authors table where the author's last and first names match what is found in the respective aliases. The results of what is selected within the parentheses will be written to the author_id column. Finally, we're having MySQL ignore col3 and col4, as well as the column heading line. Doing this maneuver with earlier versions of MySQL will require temporary columns or a temporary table along the lines of the previous example. The IGNORE flag on the first line, incidentally, instructs MySQL to ignore error messages, not to replace any duplicate rows, and to continue executing the SQL statement.



MySQL in a Nutshell
MYSQL in a Nutshell (In a Nutshell (OReilly))
ISBN: 0596514336
EAN: 2147483647
Year: 2006
Pages: 134
Authors: Russell Dyer

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net