When I wrote the first edition of this book, I shipped two copies of every example file on the CD-ROM (view CD-ROM content online at http://examples.oreilly.com/python2) -- one with Unix line-end markers, and one with DOS markers. The idea was that this would make it easy to view and edit the files on either platform. Readers would simply copy the examples directory tree designed for their platform onto their hard drive, and ignore the other one.
If you read Chapter 2, you know the issue here: DOS (and by proxy, Windows) marks line ends in text files with the two characters (carriage-return, line-feed), but Unix uses just a single . Most modern text editors don't care -- they happily display text files encoded in either format. Some tools are less forgiving, though. I still occasionally see odd characters when viewing DOS files on Unix, or an entire file in a single line when looking at Unix files on DOS (the Notepad accessory does this on Windows, for example).
Because this is only an occasional annoyance, and because it's easy to forget to keep two distinct example trees in sync, I adopted a different policy for this second edition: we're shipping a single copy of the examples (in DOS format), along with a portable converter tool for changing to and from other line-end formats.
The main obstacle, of course, is how to go about providing a portable and easy to use converter -- one that runs "out of the box" on almost every computer, without changes or recompiles. Some Unix platforms have commands like fromdos and dos2unix, but they are not universally available even on Unix. DOS batch files and csh scripts could do the job on Windows and Unix, respectively, but neither solution works on both platforms.
Fortunately, Python does. The scripts presented in Examples Example 5-1, Example 5-3, and Example 5-4 convert end-of-line markers between DOS and Unix formats; they convert a single file, a directory of files, and a directory tree of files. In this section, we briefly look at each of the three scripts, and contrast some of the system tools they apply. Each reuses the prior's code, and becomes progressively more powerful in the process.
The last of these three scripts, Example 5-4, is the portable converter tool I was looking for; it converts line ends in the entire examples tree, in a single step. Because it is pure Python, it also works on both DOS and Unix unchanged; as long as Python is installed, it is the only line converter you may ever need to remember.
5.2.1 Converting Line Ends in One File
These three scripts were developed in stages on purpose, so I could first focus on getting line-feed conversions right, before worrying about directories and tree walking logic. With that scheme in mind, Example 5-1 addresses just the task of converting lines in a single text file.
Example 5-1. PP2EPyToolsfixeoln_one.py
################################################################### # Use: "python fixeoln_one.py [tounix|todos] filename". # Convert end-of-lines in the single text file whose name is passed # in on the command line, to the target format (tounix or todos). # The _one, _dir, and _all converters reuse the convert function # here. convertEndlines changes end-lines only if necessary: # lines that are already in the target format are left unchanged, # so it's okay to convert a file > once with any of the 3 fixeoln # scripts. Notes: must use binary file open modes for this to # work on Windows, else default text mode automatically deletes # the on reads, and adds an extra for each on writes; # Mac format not supported; PyToolsdumpfile.py shows raw bytes; ################################################################### import os listonly = 0 # 1=show file to be changed, don't rewrite def convertEndlines(format, fname): # convert one file if not os.path.isfile(fname): # todos: => print 'Not a text file', fname # tounix: => return # skip directory names newlines = [] changed = 0 for line in open(fname, 'rb').readlines( ): # use binary i/o modes if format == 'todos': # else lost on Win if line[-1:] == ' ' and line[-2:-1] != ' ': line = line[:-1] + ' ' changed = 1 elif format == 'tounix': # avoids IndexError if line[-2:] == ' ': # slices are scaled line = line[:-2] + ' ' changed = 1 newlines.append(line) if changed: try: # might be read-only print 'Changing', fname if not listonly: open(fname, 'wb').writelines(newlines) except IOError, why: print 'Error writing to file %s: skipped (%s)' % (fname, why) if __name__ == '__main__': import sys errmsg = 'Required arguments missing: ["todos"|"tounix"] filename' assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg convertEndlines(sys.argv[1], sys.argv[2]) print 'Converted', sys.argv[2]
This script is fairly straightforward as system utilities go; it relies primarily on the built-in file object's methods. Given a target format flag and filename, it loads the file into a lines list using the readlines method, converts input lines to the target format if needed, and writes the result back to the file with the writelines method if any lines were changed:
C: empexamples>python %X%PyToolsfixeoln_one.py tounix PyDemos.pyw Changing PyDemos.pyw Converted PyDemos.pyw C: empexamples>python %X%PyToolsfixeoln_one.py todos PyDemos.pyw Changing PyDemos.pyw Converted PyDemos.pyw C: empexamples>fc PyDemos.pyw %X%PyDemos.pyw Comparing files PyDemos.pyw and C:PP2ndEdexamplesPP2EPyDemos.pyw FC: no differences encountered C: empexamples>python %X%PyToolsfixeoln_one.py todos PyDemos.pyw Converted PyDemos.pyw C: empexamples>python %X%PyToolsfixeoln_one.py toother nonesuch.txt Traceback (innermost last): File "C:PP2ndEdexamplesPP2EPyToolsfixeoln_one.py", line 45, in ? assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg AssertionError: Required arguments missing: ["todos"|"tounix"] filename
Here, the first command converts the file to Unix line-end format (tounix), and the second and fourth convert to the DOS convention -- all regardless of the platform on which this script is run. To make typical usage easier, converted text is written back to the file in place, instead of to a newly created output file. Notice that this script's filename has a "_" in it, not a "-"; because it is meant to be both run as a script and imported as a library, its filename must translate to a legal Python variable name in importers (fixeoln-one.py won't work for both roles).
|
5.2.1.1 Slinging bytes and verifying results
The fc DOS file-compare command in the preceding interaction confirms the conversions, but to better verify the results of this Python script, I wrote another, shown in Example 5-2.
Example 5-2. PP2EPyToolsdumpfile.py
import sys bytes = open(sys.argv[1], 'rb').read( ) print '-'*40 print repr(bytes) print '-'*40 while bytes: bytes, chunk = bytes[4:], bytes[:4] # show 4-bytes per line for c in chunk: print oct(ord(c)), ' ', # show octal of binary value print print '-'*40 for line in open(sys.argv[1], 'rb').readlines( ): print repr(line)
To give a clear picture of a file's contents, this script opens a file in binary mode (to suppress automatic line-feed conversions), prints its raw contents (bytes) all at once, displays the octal numeric ASCII codes of it contents four bytes per line, and shows its raw lines. Let's use this to trace conversions. First of all, use a simple text file to make wading through bytes a bit more humane:
C: emp>type test.txt a b c C: emp>python %X%PyToolsdumpfile.py test.txt ---------------------------------------- 'a 15 12b 15 12c 15 12' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a 15 12' 'b 15 12' 'c 15 12'
The test.txt file here is in DOS line-end format -- the escape sequence