7.2. Fixing DOS Line EndsWhen I wrote the first edition of this book, I shipped two copies of every example file on the CD-ROMone with Unix line-end markers and one with DOS markers. The idea was that this would make it easy to view and edit the files on either platform. Readers would simply copy the examples directory tree designed for their platform onto their hard drive and ignore the other one. If you read Chapter 4, you know the issue here: DOS (and by proxy, Windows) marks line ends in text files with the two characters \r\n (carriage return, line feed), but Unix uses just a single \n. Most modern text editors don't carethey happily display text files encoded in either format. Some tools are less forgiving, though. I still occasionally see the odd \r character when viewing DOS files on Unix, or an entire file in a single line when looking at Unix files on DOS (the Notepad accessory does this on Windows, for example). Because this is only an occasional annoyance, and because it's easy to forget to keep two distinct example trees in sync, I adopted a different policy as of the book's second edition: we're shipping a single copy of the examples (in DOS format), along with a portable converter tool for changing to and from other line-end formats. The main obstacle, of course, is how to go about providing a portable and easy-to-use converterone that runs "out of the box" on almost every computer, without changes or recompiles. Some Unix platforms have commands such as fromdos and dos2unix, but they are not universally available even on Unix. DOS batch files and csh scripts could do the job on Windows and Unix, respectively, but neither solution works on both platforms. Fortunately, Python does. The scripts presented in Examples 7-1, 7-3, and 7-4 convert end-of-line markers between DOS and Unix formats; they convert a single file, a directory of files, and a directory tree of files. In this section, we briefly look at each script and contrast some of the system tools they apply. Each reuses the prior script's code and becomes progressively more powerful in the process. The last of these three scripts, Example 7-4, is the portable converter tool I was looking for; it converts line ends in the entire examples tree, in a single step. Because it is pure Python, it also works on both DOS and Unix unchanged; as long as Python is installed, it is the only line converter you may ever need to remember. 7.2.1. Converting Line Ends in One FileThese three scripts were developed in stages on purpose, so that I could focus on getting line-feed conversions right before worrying about directories and tree walking logic. With that scheme in mind, Example 7-1 addresses just the task of converting lines in a single text file. Example 7-1. PP3E\PyTools\fixeoln_one.py
This script is fairly straightforward as system utilities go; it relies primarily on the built-in file object's methods. Given a target format flag and filename, it loads the file into a lines list using the readlines method, converts input lines to the target format if needed, and writes the result back to the file with the writelines method if any lines were changed: C:\temp\examples>python %X%\PyTools\fixeoln_one.py tounix PyDemos.pyw Changing PyDemos.pyw Converted PyDemos.pyw C:\temp\examples>python %X%\PyTools\fixeoln_one.py todos PyDemos.pyw Changing PyDemos.pyw Converted PyDemos.pyw C:\temp\examples>fc PyDemos.pyw %X%\PyDemos.pyw Comparing files PyDemos.pyw and C:\PP3rdEd\examples\PP3E\PyDemos.pyw FC: no differences encountered C:\temp\examples>python %X%\PyTools\fixeoln_one.py todos PyDemos.pyw Converted PyDemos.pyw C:\temp\examples>python %X%\PyTools\fixeoln_one.py toother nonesuch.txt Traceback (innermost last): File "C:\PP3rdEd\examples\PP3E\PyTools\fixeoln_one.py", line 45, in ? assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg AssertionError: Required arguments missing: ["todos"|"tounix"] filename Here, the first command converts the file to Unix line-end format (tounix), and the second and fourth convert to the DOS conventionall regardless of the platform on which this script is run. To make typical usage easier, converted text is written back to the file in place, instead of to a newly created output file. Notice that this script's filename has an _ (underscore) in it, not a - (hyphen); because it is meant to be both run as a script and imported as a library, its filename must translate to a legal Python variable name in importers (fixeoln-one.py won't work for both roles).
7.2.1.1. Slinging bytes and verifying resultsThe fc DOS file-compare command in the preceding interaction confirms the conversions, but to better verify the results of this Python script, I wrote another, shown in Example 7-2. Example 7-2. PP3E\PyTools\dumpfile.py
To give a clear picture of a file's contents, this script opens a file in binary mode (to suppress automatic line-feed conversions), prints its raw contents (bytes) all at once, displays the octal numeric ASCII codes of it contents four bytes per line, and shows its raw lines. Let's use this to trace conversions. First of all, use a simple text file to make wading through bytes a bit more humane: C:\temp>type test.txt a b c C:\temp>python %X%\PyTools\dumpfile.py test.txt ---------------------------------------- 'a\r\nb\r\nc\r\n' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a\r\n' 'b\r\n' 'c\r\n' The test.txt file here is in DOS line-end format; the escape sequence \r\n is simply the DOS line-end marker. Now, converting to Unix format changes all the DOS \r\n markers to a single \n as advertised: C:\temp>python %X%\PyTools\fixeoln_one.py tounix test.txt Changing test.txt Converted test.txt C:\temp>python %X%\PyTools\dumpfile.py test.txt ---------------------------------------- 'a\nb\nc\n' ---------------------------------------- 0141 012 0142 012 0143 012 ---------------------------------------- 'a\n' 'b\n' 'c\n' And converting back to DOS restores the original file format: C:\temp>python %X%\PyTools\fixeoln_one.py todos test.txt Changing test.txt Converted test.txt C:\temp>python %X%\PyTools\dumpfile.py test.txt ---------------------------------------- 'a\r\nb\r\nc\r\n' ---------------------------------------- 0141 015 012 0142 015 012 0143 015 012 ---------------------------------------- 'a\r\n' 'b\r\n' 'c\r\n' C:\temp>python %X%\PyTools\fixeoln_one.py todos test.txt # makes no changes Converted test.txt 7.2.1.2. Nonintrusive conversionsNotice that no "Changing" message is emitted for the last command just run because no changes were actually made to the file (it was already in DOS format). Because this program is smart enough to avoid converting a line that is already in the target format, it is safe to rerun on a file even if you can't recall what format the file already uses. More naïve conversion logic might be simpler, but it may not be repeatable. For instance, a replace string method call can be used to expand a Unix \n to a DOS \r\n, but only once: >>> lines = 'aaa\nbbb\nccc\n' >>> lines = lines.replace('\n', '\r\n') # OK: \r added >>> lines 'aaa\r\nbbb\r\nccc\r\n' >>> lines = lines.replace('\n', '\r\n') # bad: double \r >>> lines 'aaa\r\r\nbbb\r\r\nccc\r\r\n' Such logic could easily trash a file if applied to it twice.[*] To really understand how the script gets around this problem, though, we need to take a closer look at its use of slices and binary file modes.
7.2.1.3. Slicing strings out of boundsThis script relies on subtle aspects of string slicing behavior to inspect parts of each line without size checks. For instance:
Because out-of-bounds slices scale slice limits to be inbounds, the script doesn't need to add explicit tests to guarantee that the line is big enough to have end-line characters at the end. For example: >>> 'aaaXY'[-2:], 'XY'[-2:], 'Y'[-2:], ''[-2:] ('XY', 'XY', 'Y', '') >>> 'aaaXY'[-2:-1], 'XY'[-2:-1], 'Y'[-2:-1], ''[-2:-1] ('X', 'X', '', '') >>> 'aaaXY'[:-2], 'aaaY'[:-1], 'XY'[:-2], 'Y'[:-1] ('aaa', 'aaa', '', '') If you imagine characters such as \r and \n rather than the X and Y here, you'll understand how the script exploits slice scaling to good effect. 7.2.1.4. Binary file mode revisitedBecause this script aims to be portable to Windows, it also takes care to open files in binary mode, even though they contain text data. As we've seen, when files are opened in text mode on Windows, \r is stripped from \r\n markers on input, and \r is added before \n markers on output. This automatic conversion allows scripts to represent the end-of-line marker as \n on all platforms. Here, though, it would also mean that the script would never see the \r it's looking for to detect a DOS-encoded line because the \r would be dropped before it ever reached the script: >>> open('temp.txt', 'w').writelines(['aaa\n', 'bbb\n']) >>> open('temp.txt', 'rb').read( ) 'aaa\r\nbbb\r\n' >>> open('temp.txt', 'r').read( ) 'aaa\nbbb\n' Without binary open mode, this can lead to fairly subtle and incorrect behavior on Windows. For example, if files are opened in text mode, converting in todos mode on Windows would actually produce double \r characters: the script might convert the stripped \n to \r\n, which is then expanded on output to \r\r\n! >>> open('temp.txt', 'w').writelines(['aaa\r\n', 'bbb\r\n']) >>> open('temp.txt', 'rb').read( ) 'aaa\r\r\nbbb\r\r\n' With binary mode, the script inputs a full \r\n, so no conversion is performed. Binary mode is also required for output on Windows in order to suppress the insertion of \r characters; without it, the tounix conversion would fail on that platform.[*]
If all that is too subtle to bear, just remember to use the b in file open mode strings if your scripts might be run on Windows, and that you mean to process either true binary data or text data as it is actually stored in the file.
7.2.2. Converting Line Ends in One DirectoryArmed with a fully debugged single file converter, it's an easy step to add support for converting all files in a single directory. Simply call the single file converter on every filename returned by a directory listing tool. The script in Example 7-3 uses the glob module we met in Chapter 4 to grab a list of files to convert. Example 7-3. PP3E\PyTools\fixeoln_dir.py
This module defines a list, patts, containing filename patterns that match all the kinds of text files that appear in the book examples tree; each pattern is passed to the built-in glob.glob call by map to be separately expanded into a list of matching files. That's why there are nested for loops near the end. The outer loop steps through each glob result list, and the inner steps through each name within each list. Try the map call interactively if this doesn't make sense: >>> import glob >>> map(glob.glob, ['*.py', '*.html']) [['helloshell.py'], ['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']] This script requires a convert mode flag on the command line and assumes that it is run in the directory where files to be converted live; cd to the directory to be converted before running this script (or change it to accept a directory name argument too): C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix Changing Launcher.py Changing Launch_PyGadgets.py Changing LaunchBrowser.py ...lines deleted... Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing README-PP3E.txt Visited 21 files C:\temp\examples>python %X%\PyTools\fixeoln_dir.py todos Changing Launcher.py Changing Launch_PyGadgets.py Changing LaunchBrowser.py ...lines deleted... Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing README-PP3E.txt Visited 21 files C:\temp\examples>python %X%\PyTools\fixeoln_dir.py todos # makes no changes Visited 21 files C:\temp\examples>fc PyDemos.pyw %X%\PyDemos.pyw Comparing files PyDemos.pyw and C:\PP3rdEd\examples\PP3E\PyDemos.pyw FC: no differences encountered Notice that the third command generated no "Changing" messages again. Because the convertEndlines function of the single-file module is reused here to perform the actual updates, this script inherits that function's repeatability: it's OK to rerun this script on the same directory any number of times. Only lines that require conversion will be converted. This script also accepts an optional list of filename patterns on the command line in order to override the default patts list of files to be changed: C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix *.pyw *.csh Changing echoEnvironment.pyw Changing Launch_PyDemos.pyw Changing Launch_PyGadgets_bar.pyw Changing PyDemos.pyw Changing PyGadgets_bar.pyw Changing cleanall.csh Changing makeall.csh Changing package.csh Changing setup-pp.csh Changing setup-pp-embed.csh Changing xferall.linux.csh Visited 11 files C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix *.pyw *.csh Visited 11 files Also notice that the single-file script's convertEndlines function performs an initial os.path.isfile test to make sure the passed-in filename represents a file, not a directory; when we start globbing with patterns to collect files to convert, it's not impossible that a pattern's expansion might include the name of a directory along with the desired files.
7.2.3. Converting Line Ends in an Entire TreeFinally, Example 7-4 applies what we've already learned to an entire directory tree. It simply runs the file-converter function to every filename produced by tree-walking logic. In fact, this script really just orchestrates calls to the original and already debugged convertEndlines function. Example 7-4. PP3E\PyTools\fixeoln_all.py
On Windows, the script uses the portable find.find built-in tool we built in Chapter 4 (the hand-rolled equivalent of Python's original find module)[*] to generate a list of all matching file and directory names in the tree; on other platforms, it resorts to spawning a less portable and perhaps slower find shell command just for illustration purposes.
Once the file pathname lists are compiled, this script simply converts each found file in turn using the single-file converter module's tools. Here is the collection of scripts at work converting the book examples tree on Windows; notice that this script also processes the current working directory (CWD; cd to the directory to be converted before typing the command line), and that Python treats forward and backward slashes the same way in the program filename: C:\temp\examples>python %X%/PyTools/fixeoln_all.py tounix Using Python find Changing .\LaunchBrowser.py Changing .\Launch_PyGadgets.py Changing .\Launcher.py Changing .\Other\cgimail.py ...lots of lines deleted... Changing .\EmbExt\Exports\ClassAndMod\output.prog1 Changing .\EmbExt\Exports\output.prog1 Changing .\EmbExt\Regist\output Visited 1051 files C:\temp\examples>python %X%/PyTools/fixeoln_all.py todos Using Python find Changing .\LaunchBrowser.py Changing .\Launch_PyGadgets.py Changing .\Launcher.py Changing .\Other\cgimail.py ...lots of lines deleted... Changing .\EmbExt\Exports\ClassAndMod\output.prog1 Changing .\EmbExt\Exports\output.prog1 Changing .\EmbExt\Regist\output Visited 1051 files C:\temp\examples>python %X%/PyTools/fixeoln_all.py todos Using Python find Not a text file .\Embed\Inventory\Output Not a text file .\Embed\Inventory\WithDbase\Output Visited 1051 files 7.2.3.1. The view from the topThis script and its ancestors are shipped in the book's example distribution as that portable converter tool I was looking for. To convert all example files in the tree to Unix line-terminator format, simply copy the entire PP3E examples tree to some "examples" directory on your hard drive and type these two commands in a shell: cd examples/PP3E python PyTools/fixeoln_all.py tounix Of course, this assumes Python is already installed (see the example distribution's README file for details) but will work on almost every platform in use today. To convert back to DOS, just replace tounix with todos and rerun. I ship this tool with a training CD for Python classes I teach too; to convert those files, we simply type: cd Html\Examples python ..\..\Tools\fixeoln_all.py tounix Once you get accustomed to the command lines, you can use this in all sorts of contexts. Finally, to make the conversion easier for beginners to run, the top-level examples directory includes tounix.py and todos.py scripts that can be simply double-clicked in a file explorer GUI; Example 7-5 shows the tounix converter. Example 7-5. PP3E\tounix.py
This script addresses the end user's perception of usability, but other factors impact programmer usabilityjust as important to systems that will be read or changed by others. For example, the file, directory, and tree converters are coded in separate script files, but there is no law against combining them into a single program that relies on a command-line arguments pattern to know which of the three modes to run. The first argument could be a mode flag, tested by such a program: if mode == '-one': ... elif mode == '-dir': ... elif mode == '-all: ... That seems more confusing than separate files per mode, though; it's usually much easier to botch a complex command line than to type a specific program file's name. It will also make for a confusing mix of global names and one very big piece of code at the bottom of the file. As always, simpler is usually better. |