Section 7.6. Copying Directory Trees


7.6. Copying Directory Trees

The next three sections conclude this chapter by exploring a handful of additional utilities for processing directories (a.k.a. folders) on your computer with Python. They present directory copy, deletion, and comparison scripts that demonstrate system tools at work. All of these were born of necessity, are generally portable among all Python platforms, and illustrate Python development concepts along the way.

Some of these scripts do something too unique for the visitor module's classes we've been applying in early sections of this chapter, and so require more custom solutions (e.g., we can't remove directories we intend to walk through). Most have platform-specific equivalents too (e.g., drag-and-drop copies), but the Python utilities shown here are portable, easily customized, callable from other scripts, and surprisingly fast.

7.6.1. A Python Tree Copy Script

My CD writer sometimes does weird things. In fact, copies of files with odd names can be totally botched on the CD, even though other files show up in one piece. That's not necessarily a showstopper; if just a few files are trashed in a big CD backup copy, I can always copy the offending files to floppies one at a time. Unfortunately, Windows drag-and-drop copies don't play nicely with such a CD: the copy operation stops and exits the moment the first bad file is encountered. You get only as many files as were copied up to the error, but no more.

In fact, this is not limited to CD copies. I've run into similar problems when trying to back up my laptop's hard drive to another drivethe drag-and-drop copy stops with an error as soon as it reaches a file with a name that is too long to copy (common in saved web pages). The last 45 minutes spent copying is wasted time; frustrating, to say the least!

There may be some magical Windows setting to work around this feature, but I gave up hunting for one as soon as I realized that it would be easier to code a copier in Python. The cpall.py script in Example 7-25 is one way to do it. With this script, I control what happens when bad files are foundI can skip over them with Python exception handlers, for instance. Moreover, this tool works with the same interface and effect on other platforms. It seems to me, at least, that a few minutes spent writing a portable and reusable Python script to meet a need is a better investment than looking for solutions that work on only one platform (if at all).

Example 7-25. PP3E\System\Filetools\cpall.py

 ############################################################################ # Usage: "python cpall.py dirFrom dirTo". # Recursive copy of a directory tree.  Works like a "cp -r dirFrom/* dirTo" # Unix command, and assumes that dirFrom and dirTo are both directories. # Was written to get around fatal error messages under Windows drag-and-drop # copies (the first bad file ends the entire copy operation immediately), # but also allows for coding customized copy operations.  May need to # do more file type checking on Unix: skip links, fifos, etc. ############################################################################ import os, sys verbose = 0 dcount = fcount = 0 maxfileload = 500000 blksize = 1024 * 100 def cpfile(pathFrom, pathTo, maxfileload=maxfileload):     """     copy file pathFrom to pathTo, byte for byte     """     if os.path.getsize(pathFrom) <= maxfileload:         bytesFrom = open(pathFrom, 'rb').read( )      # read small file all at once         open(pathTo, 'wb').write(bytesFrom)       # need b mode on Windows     else:         fileFrom = open(pathFrom, 'rb')           # read big files in chunks         fileTo   = open(pathTo,   'wb')           # need b mode here too         while 1:             bytesFrom = fileFrom.read(blksize)    # get one block, less at end             if not bytesFrom: break               # empty after last chunk             fileTo.write(bytesFrom) def cpall(dirFrom, dirTo):     """     copy contents of dirFrom and below to dirTo     """     global dcount, fcount     for file in os.listdir(dirFrom):                      # for files/dirs here         pathFrom = os.path.join(dirFrom, file)         pathTo   = os.path.join(dirTo,   file)            # extend both paths         if not os.path.isdir(pathFrom):                   # copy simple files             try:                 if verbose > 1: print 'copying', pathFrom, 'to', pathTo                 cpfile(pathFrom, pathTo)                 fcount = fcount+1             except:                 print 'Error copying', pathFrom, 'to', pathTo, '--skipped'                 print sys.exc_info()[0], sys.exc_info( )[1]         else:             if verbose: print 'copying dir', pathFrom, 'to', pathTo             try:                 os.mkdir(pathTo)                          # make new subdir                 cpall(pathFrom, pathTo)                   # recur into subdirs                 dcount = dcount+1             except:                 print 'Error creating', pathTo, '--skipped'                 print sys.exc_info()[0], sys.exc_info( )[1] def getargs( ):     try:         dirFrom, dirTo = sys.argv[1:]     except:         print 'Use: cpall.py dirFrom dirTo'     else:         if not os.path.isdir(dirFrom):             print 'Error: dirFrom is not a directory'         elif not os.path.exists(dirTo):             os.mkdir(dirTo)             print 'Note: dirTo was created'             return (dirFrom, dirTo)         else:             print 'Warning: dirTo already exists'             if dirFrom == dirTo or (hasattr(os.path, 'samefile') and                                     os.path.samefile(dirFrom, dirTo)):                 print 'Error: dirFrom same as dirTo'             else:                 return (dirFrom, dirTo) if _ _name_ _ == '_ _main_ _':     import time     dirstuple = getargs( )     if dirstuple:         print 'Copying...'         start = time.time( )         cpall(*dirstuple)         print 'Copied', fcount, 'files,', dcount, 'directories',         print 'in', time.time( ) - start, 'seconds' 

This script implements its own recursive tree traversal logic and keeps track of both the "from" and "to" directory paths as it goes. At every level, it copies over simple files, creates directories in the "to" path, and recurs into subdirectories with "from" and "to" paths extended by one level. There are other ways to code this task (e.g., other cpall variants in the book's examples distribution change the working directory along the way with os.chdir calls), but extending paths on descent works well in practice.

Notice this script's reusable cpfile functionjust in case there are multigigabyte files in the tree to be copied, it uses a file's size to decide whether it should be read all at once or in chunks (remember, the file read method without arguments actually loads the entire file into an in-memory string). We choose fairly large file and block sizes, because the more we read at once in Python, the faster our scripts will typically run. This is more efficient than it may sound; strings left behind by prior reads will be garbage collected and reused as we go.

Also note that this script creates the "to" directory if needed, but it assumes that the directory is empty when a copy starts up; be sure to remove the target directory before copying a new tree to its name (more on this in the next section).

Here is a big book examples tree copy in action on Windows; pass in the name of the "from" and "to" directories to kick off the process, redirect the output to a file if there are too many error messages to read all at once (e.g., > output.txt), and run an rm shell command (or similar platform-specific tool) to delete the target directory first if needed:

 C:\temp>rm -rf cpexamples C:\temp>python %X%\system\filetools\cpall.py examples cpexamples Note: dirTo was created Copying... Copied 1356 files, 118 directories in 2.41999995708 seconds C:\temp>fc /B examples\System\Filetools\cpall.py               cpexamples\System\Filetools\cpall.py Comparing files examples\System\Filetools\cpall.py and cpexamples\System\Filetools\cpall.py FC: no differences encountered 

At the time I wrote this example in 2000, this test run copied a tree of 1,356 files and 118 directories in 2.4 seconds on my 650 MHz Windows 98 laptop (the built-in time.time call can be used to query the system time in seconds). It runs a bit slower if some other programs are open on the machine, and may run arbitrarily faster or slower for you. Still, this is at least as fast as the best drag-and-drop I've timed on Windows.

So how does this script work around bad files on a CD backup? The secret is that it catches and ignores file exceptions, and it keeps walking. To copy all the files that are good on a CD, I simply run a command line such as this one:

 C:\temp>python %X%\system\filetools\cpall_visitor.py                             g:\PP3rdEd\examples\PP3E cpexamples 

Because the CD is addressed as "G:" on my Windows machine, this is the command-line equivalent of drag-and-drop copying from an item in the CD's top-level folder, except that the Python script will recover from errors on the CD and get the rest. On copy errors, it prints a message to standard output and continues; for big copies, you'll probably want to redirect the script's output to a file for later inspection.

In general, cpall can be passed any absolute directory path on your machine, even those that indicate devices such as CDs. To make this go on Linux, try a root directory such as /dev/cdrom or something similar to address your CD drive.

7.6.2. Recoding Copies with a Visitor-Based Class

When I first wrote the cpall script just discussed, I couldn't see a way that the visitor class hierarchy we met earlier would help. Two directories needed to be traversed in parallel (the original and the copy), and visitor is based on climbing one tree with os.path.walk. There seemed no easy way to keep track of where the script was in the copy directory.

The trick I eventually stumbled onto is not to keep track at all. Instead, the script in Example 7-26 simply replaces the "from" directory path string with the "to" directory path string, at the front of all directory names and pathnames passed in from os.path.walk. The results of the string replacements are the paths to which the original files and directories are to be copied.

Example 7-26. PP3E\System\Filetools\cpall_visitor.py

 ########################################################### # Use: "python cpall_visitor.py fromDir toDir" # cpall, but with the visitor classes and os.path.walk; # the trick is to do string replacement of fromDir with # toDir at the front of all the names walk passes in; # assumes that the toDir does not exist initially; ########################################################### import os from PP3E.PyTools.visitor import FileVisitor from cpall import cpfile, getargs verbose = True class CpallVisitor(FileVisitor):     def _ _init_ _(self, fromDir, toDir):         self.fromDirLen = len(fromDir) + 1         self.toDir      = toDir         FileVisitor._ _init_ _(self)     def visitdir(self, dirpath):         toPath = os.path.join(self.toDir, dirpath[self.fromDirLen:])         if verbose: print 'd', dirpath, '=>', toPath         os.mkdir(toPath)         self.dcount += 1     def visitfile(self, filepath):         toPath = os.path.join(self.toDir, filepath[self.fromDirLen:])         if verbose: print 'f', filepath, '=>', toPath         cpfile(filepath, toPath)         self.fcount += 1 if _ _name_ _ == '_ _main_ _':     import sys, time     fromDir, toDir = sys.argv[1:3]     if len(sys.argv) > 3: verbose = 0     print 'Copying...'     start = time.time( )     walker = CpallVisitor(fromDir, toDir)     walker.run(startDir=fromDir)     print 'Copied', walker.fcount, 'files,', walker.dcount, 'directories',     print 'in', time.time( ) - start, 'seconds' 

This version accomplishes roughly the same goal as the original, but it has made a few assumptions to keep code simple. The "to" directory is assumed not to exist initially, and exceptions are not ignored along the way. Here it is copying the book examples tree again on Windows:

 C:\temp>rm -rf cpexamples C:\temp>python %X%\system\filetools\cpall_visitor.py                                            examples cpexamples -quiet Copying... Copied 1356 files, 119 directories in 2.09000003338 seconds C:\temp>fc /B examples\System\Filetools\cpall.py               cpexamples\System\Filetools\cpall.py Comparing files examples\System\Filetools\cpall.py and cpexamples\System\Filetools\cpall.py FC: no differences encountered 

Despite the extra string slicing going on, this version runs just as fast as the original. For tracing purposes, this version also prints all the "from" and "to" copy paths during the traversal unless you pass in a third argument on the command line or set the script's verbose variable to False or 0:

 C:\temp>python %X%\system\filetools\cpall_visitor.py examples cpexamples  Copying... d examples => cpexamples\ f examples\autoexec.bat => cpexamples\autoexec.bat f examples\cleanall.csh => cpexamples\cleanall.csh  ...more deleted... d examples\System => cpexamples\System f examples\System\System.txt => cpexamples\System\System.txt f examples\System\more.py => cpexamples\System\more.py f examples\System\reader.py => cpexamples\System\reader.py  ...more deleted... Copied 1356 files, 119 directories in 2.31000006199 seconds 




Programming Python
Programming Python
ISBN: 0596009259
EAN: 2147483647
Year: 2004
Pages: 270
Authors: Mark Lutz

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net