Hack 17 Delimiter Dilemma


figs/expert.gif figs/hack17.gif

Deal with double quotation marks in delimited files.

Importing data from a delimited text file into an application is usually painless. Even if you need to change the delimiter from one character to another (from a comma to a colon, for example), you can choose from many tools that perform simple character substitution with great ease.

However, one common situation is not solved as easily: many business applications export data into a space- or comma-delimited file, enclosing individual fields in double quotation marks. These fields often contain the delimiter character. Importing such a file into an application that processes only one delimiter (PostgreSQL for example) may result in an incorrect interpretation of the data. This is one of those situations where the user should feel lucky if the process fails.

One solution is to write a script that tracks the use of double quotes to determine whether it is working within a text field. This is doable by creating a variable that acts as a text/nontext switch for the character substitution process. The script should change the delimiter to a more appropriate character, leave the delimiters that were enclosed in double quotes unchanged, and remove the double quotes. Rather than make the changes to the original datafile, it's safer to write the edited data to a new file.

2.6.1 Attacking the Problem

The following algorithm meets our needs:

  1. Create the switch variable and assign it the value of 1, meaning "nontext". We'll declare the variable tswitch and define it as tswitch = 1.

  2. Create a variable for the delimiter and define it. We'll use the variable delim with a space as the delimiter, so delim = ' '.

  3. Decide on a better delimiter. We'll use the tab character, so new_delim = '\t'.

  4. Open the datafile for reading.

  5. Open a new file for writing.

Now, for every character in the datafile:

  1. Read a character from the datafile.

  2. If the character is a double quotation mark, tswitch = tswitch * -1.

  3. If the character equals the character in delim and tswitch equals 1, write new_delim to the new file.

  4. If the character equals that in delim and tswitch equals -1, write the value of delim to the new file.

  5. If the character is anything else, write the character to the new file.

2.6.2 The Code

The Python script redelim.py implements the preceding algorithm. It prompts the user for the original datafile and a name for the new datafile. The delim and new_delim variables are hardcoded, but those are easily changed within the script.

This script copies a space-delimited text file with text values in double quotes to a new, tab-delimited file without the double quotes. The advantage of using this script is that it leaves spaces that were within double quotes unchanged.

There are no command-line arguments for this script. The script will prompt the user for source and destination file information.

You can redefine the variables for the original and new delimiters, delim and new_delim, in the script as needed.

#!/usr/local/bin/python import os print """ Change text file delimiters. # Ask user for source and target files. sourcefile = raw_input('Please enter the path and name of the source file:') targetfile = raw_input('Please enter the path and name of the target file:') # Open files for reading and writing. source = open(sourcefile,'r') dest   = open(targetfile,'w') # The variable 'm' acts as a text/non-text switch that reminds python # whether it is working within a text or non-text data field. tswitch = 1 # If the source delimiter that you want to change is not a space, # redefine the variable delim in the next line. delim = ' ' # If the new delimiter that you want to change is not a tab, # redefine the variable new_delim in the next line. new_delim = '\t' for charn in source.read( ):         if tswitch =  = 1:               if charn =  = delim:                        dest.write(new_delim)               elif charn =  = '\"':                        tswitch = tswitch * -1               else:                        dest.write(charn)      elif tswitch =  = -1:               if charn =  = '\"':                       tswitch = tswitch * -1               else:                       dest.write(charn) source.close( ) dest.close( )

Use of redelim.py assumes that you have installed Python, which is available through the ports collection or as a binary package. The Python module used in this code is installed by default.

2.6.3 Hacking the Hack

If you prefer working with Perl, DBD::AnyData is another good solution to this problem.

2.6.4 See Also

  • The Python home page (http://www.python.org/)



BSD Hacks
BSD Hacks
ISBN: 0596006799
EAN: 2147483647
Year: 2006
Pages: 160
Authors: Lavigne

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net