7.5. Visitor: Walking Trees GenericallyArmed with the portable search_all script from Example 7-10, I was able to better pinpoint files to be edited every time I changed the book examples tree structure. At least initially, in one window I ran search_all to pick out suspicious files and edited each along the way by hand in another window. Pretty soon, though, this became tedious too. Manually typing filenames into editor commands is no fun, especially when the number of files to edit is large; the search for "Part2" shown earlier returned 74 files, for instance. Since I occasionally have better things to do than manually start 74 editor sessions, I looked for a way to automatically run an editor on each suspicious file. Unfortunately, search_all simply prints results to the screen. Although that text could be intercepted and parsed, a more direct approach that spawns edit sessions during the search may be easier, but may require major changes to the tree search script as currently coded. At this point, two thoughts came to mind. First, I knew it would be easier in the long run to be able to add features to a general directory searcher as external components, not by changing the original script. Because editing files was just one possible extension (what about automating text replacements too?), a more generic, customizable, and reusable search component seemed the way to go. Second, after writing a few directory walking utilities, it became clear that I was rewriting the same sort of code over and over again. Traversals could be even further simplified by wrapping common details for easier reuse. The os.path.walk tool helps, but its use tends to foster redundant operations (e.g., directory name joins), and its function-object-based interface doesn't quite lend itself to customization the way a class can. Of course, both goals point to using an object-oriented framework for traversals and searching. Example 7-11 is one concrete realization of these goals. It exports a general FileVisitor class that mostly just wraps os.path.walk for easier use and extension, as well as a generic SearchVisitor class that generalizes the notion of directory searches. By itself, SearchVisitor simply does what search_all did, but it also opens up the search process to customization; bits of its behavior can be modified by overloading its methods in subclasses. Moreover, its core search logic can be reused everywhere we need to search. Simply define a subclass that adds search-specific extensions. As is usual in programming, once you repeat tactical tasks often enough, they tend to inspire this kind of strategic thinking. Example 7-11. PP3E\PyTools\visitor.py
This module primarily serves to export classes for external use, but it does something useful when run standalone too. If you invoke it as a script with a single argument, 1, it makes and runs a FileVisitor object and prints an exhaustive listing of every file and directory at and below the place you are at when the script is invoked (i.e., ".", the current working directory): C:\temp>python %X%\PyTools\visitor.py 1 . ... 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py 5 => .\Launcher.pyc 6 => .\Launch_PyGadgets.py 7 => .\Launch_PyDemos.pyw ...more deleted... 479 => .\Gui\Clock\plotterGui.py 480 => .\Gui\Clock\plotterText.py 481 => .\Gui\Clock\plotterText1.py 482 => .\Gui\Clock\_ _init_ _.py .\Gui\gifs ... 483 => .\Gui\gifs\frank.gif 484 => .\Gui\gifs\frank.note 485 => .\Gui\gifs\gilligan.gif 486 => .\Gui\gifs\gilligan.note ...more deleted... 1352 => .\PyTools\visitor_fixnames.py 1353 => .\PyTools\visitor_find_quiet2.py 1354 => .\PyTools\visitor_find.pyc 1355 => .\PyTools\visitor_find_quiet1.py 1356 => .\PyTools\fixeoln_one.doc.txt Visited 1356 files and 119 dirs If you instead invoke this script with a 2 as its first argument, it makes and runs a SearchVisitor object using the second argument as the search key. This form is equivalent to running the search_all.py script we met earlier; it pauses for an Enter key press after each matching file is reported (lines in bold font here): C:\temp\examples>python %X%\PyTools\visitor.py 2 Part3 . ... 1 => .\autoexec.bat 2 => .\cleanall.csh .\cleanall.csh has Part3 3 => .\echoEnvironment.pyw 4 => .\Launcher.py .\Launcher.py has Part3 5 => .\Launcher.pyc Skipping .\Launcher.pyc 6 => .\Launch_PyGadgets.py 7 => .\Launch_PyDemos.pyw 8 => .\LaunchBrowser.out.txt 9 => .\LaunchBrowser.py 10 => .\Launch_PyGadgets_bar.pyw 11 => .\makeall.csh .\makeall.csh has Part3 ... ...more deleted ... 1353 => .\PyTools\visitor_find_quiet2.py 1354 => .\PyTools\visitor_find.pyc Skipping .\PyTools\visitor_find.pyc 1355 => .\PyTools\visitor_find_quiet1.py 1356 => .\PyTools\fixeoln_one.doc.txt Found in 49 files, visited 1356 Technically, passing this script a first argument of 3 runs both a FileVisitor and a SearchVisitor (two separate traversals are performed). The first argument is really used as a bit mask to select one or more supported self-tests; if a test's bit is on in the binary value of the argument, the test will be run. Because 3 is 011 in binary, it selects both a search (010) and a listing (001). In a more user-friendly system, we might want to be more symbolic about that (e.g., check for -search and -list arguments), but bit masks work just as well for this script's scope.
7.5.1. Editing Files in Directory TreesNow, after genericizing tree traversals and searches, it's an easy step to add automatic file editing in a brand-new, separate component. Example 7-12 defines a new EditVisitor class that simply customizes the visitmatch method of the SearchVisitor class to open a text editor on the matched file. Yes, this is the complete program. It needs to do something special only when visiting matched files, and so it needs provide only that behavior; the rest of the traversal and search logic is unchanged and inherited. Example 7-12. PP3E\PyTools\visitor_edit.py
When we make and run an EditVisitor, a text editor is started with the os.system command-line spawn call, which usually blocks its caller until the spawned program finishes. On my machines, each time this script finds a matched file during the traversal, it starts up the vi text editor within the console window where the script was started; exiting the editor resumes the tree walk. Let's find and edit some files. When run as a script, we pass this program the search string as a command argument (here, the string -exec is the search key, not an option flag). The root directory is always passed to the run method as ".", the current run directory. Traversal status messages show up in the console as before, but each matched file now automatically pops up in a text editor along the way. Here, the editor is started eight times: C:\...\PP3E>python PyTools\visitor_edit.py -exec 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py 5 => .\Launcher.pyc Skipping .\Launcher.pyc ...more deleted... 1340 => .\old_Part2\Basics\unpack2.py 1341 => .\old_Part2\Basics\unpack2b.py 1342 => .\old_Part2\Basics\unpack3.py 1343 => .\old_Part2\Basics\_ _init_ _.py Edited 8 files, visited 1343 This, finally, is the exact tool I was looking for to simplify global book examples tree maintenance. After major changes to things such as shared modules and file and directory names, I run this script on the examples root directory with an appropriate search string and edit any files it pops up as needed. I still need to change files by hand in the editor, but that's often safer than blind global replacements. 7.5.2. Global Replacements in Directory TreesBut since I brought it up, given a general tree traversal class, it's easy to code a global search-and-replace subclass too. The FileVisitor subclass in Example 7-13, ReplaceVisitor, customizes the visitfile method to globally replace any appearances of one string with another, in all text files at and below a root directory. It also collects the names of all files that were changed in a list just in case you wish to go through and verify the automatic edits applied (a text editor could be automatically popped up on each changed file, for instance). Example 7-13. PP3E\PyTools\visitor_replace.py
To run this script over a directory tree, go to the directory to be changed and run the following sort of command line with "from" and "to" strings. On my current machine, doing this on a 1,354-file tree and changing 75 files along the way takes roughly six seconds of real clock time when the system isn't particularly busy. C:\temp\examples>python %X%/PyTools/visitor_replace.py Part2 SPAM2 Are you sure?y . ... 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py 5 => .\Launcher.pyc Skipping .\Launcher.pyc 6 => .\Launch_PyGadgets.py ...more deleted... 1351 => .\PyTools\visitor_find_quiet2.py 1352 => .\PyTools\visitor_find.pyc Skipping .\PyTools\visitor_find.pyc 1353 => .\PyTools\visitor_find_quiet1.py 1354 => .\PyTools\fixeoln_one.doc.txt Visited 1354 files Changed 75 files: .\Launcher.py .\LaunchBrowser.out.txt .\LaunchBrowser.py .\PyDemos.pyw .\PyGadgets.py .\README-PP3E.txt ...more deleted... .\PyTools\search_all.out.txt .\PyTools\visitor.out.txt .\PyTools\visitor_edit.py [to delete, use an empty toStr] C:\temp\examples>python %X%/PyTools/visitor_replace.py SPAM "" This is both wildly powerful and dangerous. If the string to be replaced can show up in places you didn't anticipate, you might just ruin an entire tree of files by running the ReplaceVisitor object defined here. On the other hand, if the string is something very specific, this object can obviate the need to automatically edit suspicious files. For instance, we will use this approach to automatically change web site addresses in HTML files in Chapter 16; the addresses are likely too specific to show up in other places by chance. 7.5.3. Collecting Matched Files in TreesThe scripts so far search and replace in directory trees, using the same traversal code base (the visitor module). Suppose, though, that you just want to get a Python list of files in a directory containing a string. You could run a search and parse the output messages for "found" messages. Much simpler, simply knock off another SearchVisitor subclass to collect the list along the way, as in Example 7-14. Example 7-14. PP3E\PyTools\visitor_collect.py
CollectVisitor is just a tree search again, with a new kind of specializationcollecting files instead of printing messages. This class is useful from other scripts that mean to collect a matched files list for later processing; it can be run by itself as a script too: C:\...\PP3E>python PyTools\visitor_collect.py -exec ... ...more deleted... ... 1342 => .\old_Part2\Basics\unpack2b.py 1343 => .\old_Part2\Basics\unpack3.py 1344 => .\old_Part2\Basics\_ _init_ _.py Found these files: .\package.csh .\README-PP3E.txt .\readme-old-pp1E.txt .\PyTools\cleanpyc.py .\PyTools\fixeoln_all.py .\System\Processes\output.txt .\Internet\Cgi-Web\fixcgi.py 7.5.3.1. Suppressing status messagesHere, the items in the collected list are displayed at the endall the files containing the string -exec. Notice, though, that traversal status messages are still printed along the way (in fact, I deleted about 1,600 lines of such messages here!). In a tool meant to be called from another script, that may be an undesirable side effect; the calling script's output may be more important than the traversal's. We could add mode flags to SearchVisitor to turn off status messages, but that makes it more complex. Instead, the following two files show how we might go about collecting matched filenames without letting any traversal messages show up in the console, all without changing the original code base. The first, shown in Example 7-15, simply takes over and copies the search logic, without print statements. It's a bit redundant with SearchVisitor, but only in a few lines of mimicked code. Example 7-15. PP3E\PyTools\visitor_collect_quiet1.py
When this class is run, only the contents of the matched filenames list show up at the end; no status messages appear during the traversal. Because of that, this form may be more useful as a general-purpose tool used by other scripts: C:\...\PP3E>python PyTools\visitor_collect_quiet1.py -exec Found these files: .\package.csh .\README-PP3E.txt .\readme-old-pp1E.txt .\PyTools\cleanpyc.py .\PyTools\fixeoln_all.py .\System\Processes\output.txt .\Internet\Cgi-Web\fixcgi.py A more interesting and less redundant way to suppress printed text during a traversal is to apply the stream redirection tricks we met in Chapter 3. Example 7-16 sets sys.stdin to a NullOut object that throws away all printed text for the duration of the traversal (its write method does nothing). We could also use the StringIO module we met in Chapter 3 for this purpose, but it's overkill here; we don't need to retain printed text. The only real complication with this scheme is that there is no good place to insert a restoration of sys.stdout at the end of the traversal; instead, we code the restore in the _ _del_ _ destructor method and require clients to delete the visitor to resume printing as usual. An explicitly called method would work just as well, if you prefer less magical interfaces. Example 7-16. PP3E\PyTools\visitor_collect_quiet2.py
When this script is run, output is identical to the prior runjust the matched filenames at the end. Perhaps better still, why not code and debug just one verbose CollectVisitor utility class, and require clients to wrap calls to its run method in the redirect.redirect function we wrote in Example 3-10? >>> from PP3E.PyTools.visitor_collect import CollectVisitor >>> from PP3E.System.Streams.redirect import redirect >>> walker = CollectVisitor('-exec') # object to find '-exec' >>> output = redirect(walker.run, ('.',), '') # function, args, input >>> for line in walker.matches: print line # print items in list ... .\package.csh .\README-PP3E.txt .\readme-old-pp1E.txt .\PyTools\cleanpyc.py .\PyTools\fixeoln_all.py .\System\Processes\output.txt .\Internet\Cgi-Web\fixcgi.py The redirect call employed here resets standard input and output streams to file-like objects for the duration of any function call; because of that, it's a more general way to suppress output than recoding every outputter. Here, it has the effect of intercepting (and hence suppressing) printed messages during a walker.run('.') traversal. They really are printed, but show up in the string result of the redirect call, not on the screen: >>> output[:60] '. ...\n1 => .\\autoexec.bat\n2 => .\\cleanall.csh\n3 => .\\echoEnv' >>> len(output), len(output.split('\n')) # bytes, lines (67609, 1592) >>> walker.matches ['.\\package.csh', '.\\README-PP3E.txt', '.\\readme-old-pp1E.txt', '.\\PyTools\\cleanpyc.py', '.\\PyTools\\fixeoln_all.py', '.\\System\\Processes\\output.txt', '.\\Internet\\Cgi-Web\\fixcgi.py'] Because redirect saves printed text in a string, it may be less appropriate than the two quiet CollectVisitor variants for functions that generate much output. Here, for example, 67,609 bytes of output were queued up in an in-memory string (see the len call results); such a buffer may or may not be significant in most applications. In more general terms, redirecting sys.stdout to dummy objects as done here is a simple way to turn off outputs (and is the equivalent to the Unix notion of redirecting output to the file /dev/nulla file that discards everything sent to it). For instance, we'll pull this trick out of the bag again in the context of server-side Internet scripting, to prevent utility status messages from showing up in generated web page output streams.[*]
7.5.4. Recoding Fixers with VisitorsBe warned: once you've written and debugged a class that knows how to do something useful like walking directory trees, it's easy for it to spread throughout your system utility libraries. Of course, that's the whole point of code reuse. For instance, very soon after writing the visitor classes presented in the prior sections, I recoded both the fixnames_all.py and the fixeoln_all.py directory walker scripts listed earlier in Examples 7-6 and 7-4, respectively, to use visitor rather than proprietary tree-walk logic (they both originally used find.find). Example 7-17 combines the original convertLines function (to fix end-of-lines in a single file) with visitor's tree walker class, to yield an alternative implementation of the line-end converter for directory trees. Example 7-17. PP3E\PyTools\visitor_fixeoln.py
As we saw in Chapter 4, the built-in fnmatch module performs Unix shell-like filename matching; this script uses it to match names to the previous version's filename patterns (simply looking for filename extensions after a "." is simpler, but not as general): C:\temp\examples>python %X%/PyTools/visitor_fixeoln.py tounix . ... Changing .\echoEnvironment.pyw Changing .\Launcher.py Changing .\Launch_PyGadgets.py Changing .\Launch_PyDemos.pyw ...more deleted... Changing .\PyTools\visitor_find.py Changing .\PyTools\visitor_fixnames.py Changing .\PyTools\visitor_find_quiet2.py Changing .\PyTools\visitor_find_quiet1.py Changing .\PyTools\fixeoln_one.doc.txt Files matched (converted or not): 1065 C:\temp\examples>python %X%/PyTools/visitor_fixeoln.py tounix ...more deleted... .\Extend\Swig\Shadow ... .\ ... .\EmbExt\Exports ... .\EmbExt\Exports\ClassAndMod ... .\EmbExt\Regist ... .\PyTools ... Files matched (converted or not): 1065 If you run this script and the original fixeoln_all.py on the book examples tree, you'll notice that this version visits two fewer matched files. This simply reflects the fact that fixeoln_all also collects and skips over two directory names for its patterns in the find.find result (both called "Output"). In all other ways, this version works the same way even when it could do better; adding a break statement after the convertEndlines call here avoids visiting files that appear redundantly in the original's find results lists. The second command here takes roughly two-thirds as long as the first to finish on my computer (there are no files to be converted). That's roughly 33 percent faster than the original find.find-based version of this script, but they differ in the amount of output, and benchmarks are usually much subtler than you imagine. Most of the real clock time is likely spent scrolling text in the console, not doing any real directory processing. Since both are plenty fast for their intended purposes, finer-grained performance figures are left as exercises. The script in Example 7-18 combines the original convertOne function (to rename a single file or directory) with the visitor's tree walker class, to create a directory tree-wide fix for uppercase filenames. Notice that we redefine both file and directory visitation methods here, as we need to rename both. Example 7-18. PP3E\PyTools\visitor_fixnames.py
This version is run like the original find.find-based version, fixnames_all, but visits one more name (the top-level root directory), and there is no initial delay while filenames are collected on a listwe're using os.path.walk again, not find.find. It's also close to the original os.path.walk version of this script but is based on a class hierarchy, not direct function callbacks: C:\temp\examples>python %X%/PyTools/visitor_fixnames.py ...more deleted... 303 => .\_ _init_ _.py 304 => .\_ _init_ _.pyc 305 => .\Ai\ExpertSystem\holmes.tar 306 => .\Ai\ExpertSystem\TODO Convert dir=.\Ai\ExpertSystem file=TODO? (y|Y) 307 => .\Ai\ExpertSystem\_ _init_ _.py 308 => .\Ai\ExpertSystem\holmes\cnv 309 => .\Ai\ExpertSystem\holmes\README.1ST Convert dir=.\Ai\ExpertSystem\holmes file=README.1ST? (y|Y) ...more deleted... 1353 => .\PyTools\visitor_find.pyc 1354 => .\PyTools\visitor_find_quiet1.py 1355 => .\PyTools\fixeoln_one.doc.txt Converted 1 files, visited 1474 Both of these fixer scripts work roughly the same way as the originals, but because the directory-walking logic lives in just one file (visitor.py), it needs to be debugged only once. Moreover, improvements in that file will automatically be inherited by every directory-processing tool derived from its classes. Even when coding system-level scripts, reuse and reduced redundancy pay off in the end. 7.5.5. Fixing File Permissions in TreesJust in case the preceding visitor-client sections weren't quite enough to convince you of the power of code reuse, another piece of evidence surfaced very late in this book project. It turns out that copying files off a CD using Windows drag-and-drop sometimes makes them read only in the copy. That's less than ideal for the book examples distribution if it is obtained on CD; you must copy the directory tree onto your hard drive to be able to experiment with program changes (naturally, files on CD can't be changed in place). But if you copy with drag-and-drop, you may wind up with a tree of more than 1,000 read-only files.
Since drag-and-drop is perhaps the most common way to copy off a CD on Windows, I needed a portable and easy-to-use way to undo the read-only setting. Asking readers to make all of these writable by hand would be impolite, to say the least. Writing a full-blown install system seemed like overkill. Providing different fixes for different platforms doubles or triples the complexity of the task. Much better, the Python script in Example 7-19 can be run in the root of the copied examples directory to repair the damage of a read-only drag-and-drop operation. It specializes the traversal implemented by the FileVisitor class again, this time to run an os.chmod call on every file and directory visited along the way. Example 7-19. PP3E\PyTools\fixreadonly-all.py
As we saw in Chapter 3, the built-in os.chmod call changes the permission settings on an external file (here, to 0777global read, write, and execute permissions). Because os.chmod and the FileVisitor's operations are portable, this same script will work to set permissions in an entire tree on both Windows and Unix-like platforms. Notice that it asks whether you really want to proceed when it first starts up, just in case someone accidentally clicks the file's name in an explorer GUI. Also note that Python must be installed before this script can be run in order to make files writable; that seems a fair assumption to make about users who are about to change Python scripts. C:\temp\examples>python PyTools\fixreadonly-all.py This script makes all files writeable; continue?y . ... 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw ...more deleted... 1352 => .\PyTools\visitor_find.pyc 1353 => .\PyTools\visitor_find_quiet1.py 1354 => .\PyTools\fixeoln_one.doc.txt Visited 1354 files and 119 dirs 7.5.6. Changing Unix Executable Path LinesFinally, the following script does something more unique: it uses the visitor classes to replace the "#!" lines at the top of all scripts in a directory tree (this line gives the path to the Python interpreter on Unix-like machines). It's easy to do this with the visitor_replace script of Example 7-13 that we coded earlier. For example, say something like this to replace all #!/usr/bin/python lines with #!\Python24\python: C:\...\PP3E>python PyTools\visitor_replace.py #!/usr/bin/python #!\Python24\python Lots of status messages scroll by unless redirected to a file. visitor_replace does a simple global search-and-replace operation on all nonbinary files in an entire directory tree. It's also a bit naïve: it won't change other "#!" line patterns that mention python (e.g., you'll have to run it again to change #!/usr/local/bin/python), and it might change occurrences besides those on a first line. That probably won't matter, but if it does, it's easy to write your own visitor subclass to be more accurate. When run, the script in Example 7-20 converts all "#!" lines in all script files in an entire tree. It changes every first line that starts with "#!" and names "python" to a line you pass in on the command line or assign in the script, like this: C:\...\PP3E>python PyTools\visitor_poundbang.py #!\MyPython24\python Are you sure?y . ... 1 => .\_ _init_ _.py 2 => .\PyDemos2.pyw 3 => .\towriteable.py ... 1474 => .\Integrate\Mixed\Exports\ClassAndMod\output.prog1 1475 => .\Integrate\Mixed\Exports\ClassAndMod\setup-class.csh Visited 1475 files and 133 dirs, changed 190 files .\towriteable.py .\Launch_PyGadgets.py .\Launch_PyDemos.pyw ... C:\...\PP3E>type .\Launch_PyGadgets.py #!\MyPython24\python ############################################### # PyGadgets + environment search/config first ... This script caught and changed 190 files (more than visitor_replace), so there must be other "#!" line patterns lurking in the examples tree besides #!/usr/bin/python. Example 7-20. PP3E\PyTools\visitor_poundbang.py
7.5.7. Summary: Counting Source Lines Four WaysWe've seen a few techniques for scanning directory trees in this book so far. To summarize and contrast, this section briefly lists four scripts that count the number of lines in all program source files in an entire tree. Each script uses a different directory traversal scheme, but returns the same result. I counted 41,938 source lines of code (SLOC) in the book examples distribution with these scripts, as of November 2001 (for the second edition of this book). Study these scripts' code for more details. They don't count everything (e.g., they skip makefiles), but are comprehensive enough for ballpark figures. Here's the output for the visitor class version when run on the root of the book examples tree; the root of the tree to walk is passed in as a command-line argument, and the last output line is a dictionary that keeps counts for the specific file-type extensions in the tree: C:\temp>python wcall_visitor.py %X% ...lines deleted... C:\PP2ndEd\examples\PP3E\Integrate\Mixed\Exports\ClassAndMod\cinterface.py C:\PP2ndEd\examples\PP3E\Integrate\Mixed\Exports\ClassAndMod\main-table.c Visited 1478 files and 133 dirs -------------------------------------------------------------------------------- Files=> 903 Lines=> 41938 {'.c': 46, '.cgi': 24, '.html': 41, '.pyw': 11, '.cxx': 2, '.py': 768, '.i': 3, '.h': 8} The first version, listed in Example 7-21, counts lines using the standard library's os.path.walk call, which we met in Chapter 4 (using os.walk would be similar, but we would replace the callback function with a for loop, and subdirectories and files would be segregated into two lists of names). Example 7-21. PP3E\PyTools\wcall.py
Counting with the find module we wrote at the end of Chapter 4 with Example 7-22 is noticeably simpler, though we must wait for the list of files to be collected. Example 7-22. PP3E\PyTools\wcall_find.py
The prior script collected all source files in the tree with find and manually checked their extensions; the next script (Example 7-23) uses the pattern-matching capability in find to collect only source files in the result list. Example 7-23. PP3E\PyTools\wcall_find_patt.py
And finally, Example 7-24 is the SLOC counting logic refactored to use the visitor class framework we wrote in this chapter; OOP adds a bit more code here, but this version is more accurate (if a directory name happens to have a source-like extension, the prior versions will incorrectly tally it). More importantly, by using OOP:
Even in the systems tools domains, strategic thinking can pay off eventually. Example 7-24. PP3E\PyTools\wcall_visitor.py
|