Deal with double quotation marks in delimited files.
Importing data from a delimited text file into an application is usually painless. Even if you need to change the delimiter from one character to another (from a comma to a colon, for example), you can choose from many tools that perform simple character substitution with great ease.
However, one common situation is not solved as easily: many business applications export data into a space- or comma-delimited file, enclosing individual fields in double quotation marks. These fields often contain the delimiter character. Importing such a file into an application that processes only one delimiter (PostgreSQL for example) may result in an incorrect interpretation of the data. This is one of those situations where the user should feel lucky if the process fails.
One solution is to write a script that tracks the use of double quotes to determine whether it is working within a text field. This is doable by creating a variable that acts as a text/nontext switch for the character substitution process. The script should change the delimiter to a more appropriate character, leave the delimiters that were enclosed in double quotes unchanged, and remove the double quotes. Rather than make the changes to the original datafile, it's safer to write the edited data to a new file.
2.6.1 Attacking the Problem
The following algorithm meets our needs:
Now, for every character in the datafile:
2.6.2 The Code
The Python script redelim.py implements the preceding algorithm. It prompts the user for the original datafile and a name for the new datafile. The delim and new_delim variables are hardcoded, but those are easily changed within the script.
This script copies a space-delimited text file with text values in double quotes to a new, tab-delimited file without the double quotes. The advantage of using this script is that it leaves spaces that were within double quotes unchanged.
There are no command-line arguments for this script. The script will prompt the user for source and destination file information.
You can redefine the variables for the original and new delimiters, delim and new_delim, in the script as needed.
#!/usr/local/bin/python import os print """ Change text file delimiters. # Ask user for source and target files. sourcefile = raw_input('Please enter the path and name of the source file:') targetfile = raw_input('Please enter the path and name of the target file:') # Open files for reading and writing. source = open(sourcefile,'r') dest = open(targetfile,'w') # The variable 'm' acts as a text/non-text switch that reminds python # whether it is working within a text or non-text data field. tswitch = 1 # If the source delimiter that you want to change is not a space, # redefine the variable delim in the next line. delim = ' ' # If the new delimiter that you want to change is not a tab, # redefine the variable new_delim in the next line. new_delim = '\t' for charn in source.read( ): if tswitch = = 1: if charn = = delim: dest.write(new_delim) elif charn = = '\"': tswitch = tswitch * -1 else: dest.write(charn) elif tswitch = = -1: if charn = = '\"': tswitch = tswitch * -1 else: dest.write(charn) source.close( ) dest.close( )
Use of redelim.py assumes that you have installed Python, which is available through the ports collection or as a binary package. The Python module used in this code is installed by default.
2.6.3 Hacking the Hack
If you prefer working with Perl, DBD::AnyData is another good solution to this problem.
2.6.4 See Also