Section 7.4. Searching Directory Trees


7.4. Searching Directory Trees

Engineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree whenever I thought I'd stumbled onto a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things such as program directory paths and module names were hardcoded all over the placein package import statements, program startup calls, text notes, configuration files, and more.

One way to repair these references, of course, is to edit every file in the directory by hand, searching each for information that has changed. That's so tedious as to be utterly impossible in this book's examples tree, though; as I wrote these words, the examples tree contained 118 directories and 1,342 files! (To count for yourself, run a command-line python PyTools/visitor.py 1 in the PP3E examples root directory.) Clearly, I needed a way to automate updates after changes.

7.4.1. Greps and Globs in Shells and Python

There is a standard way to search files for strings on Unix and Linux systems: the command-line program grep and its relatives list all lines in one or more files containing a string or string pattern.[] Given that Unix shells expand (i.e., "glob) filename patterns automatically, a command such as grep popen *.py will search a single directory's Python files for the string "popen". Here's such a command in action on Windows (I installed a commercial Unix-like fgrep program on my Windows laptop because I missed it too much there):

[] In fact, the act of searching files often goes by the colloquial name "grepping among developers who have spent any substantial time in the Unix ghetto.

 C:\...\PP3E\System\Filetools>fgrep popen *.py diffall.py:# - we could also os.popen a diff (unix) or fc (dos) dirdiff.py:# - use os.popen('ls...') or glob.glob + os.path.split dirdiff6.py:    files1 = os.popen('ls %s' % dir1).readlines( ) dirdiff6.py:    files2 = os.popen('ls %s' % dir2).readlines( ) testdirdiff.py:    expected = expected + os.popen(test % 'dirdiff').read( ) testdirdiff.py:        output = output + os.popen(test % script).read( ) 

DOS has a command for searching files toofind, not to be confused with the Unix find directory walker command:

 C:\...\PP3E\System\Filetools>find /N "popen" testdirdiff.py ---------- testdirdiff.py [8]    expected = expected + os.popen(test % 'dirdiff').read( ) [15]        output = output + os.popen(test % script).read( ) 

You can do the same within a Python script by running the previously mentioned shell command with os.system or os.popen. Until recently, this could also be done by combining the (now defunct) grep and glob built-in modules. We met the glob module in Chapter 4; it expands a filename pattern into a list of matching filename strings (much like a Unix shell). In the past, the standard library also included a grep module, which acted like a Unix grep command: grep.grep printed lines containing a pattern string among a set of files. When used with glob, the effect was much like that of the fgrep command:

 >>> from grep import grep >>> from glob import glob >>> grep('popen', glob('*.py')) diffall.py:  16: # - we could also os.popen a diff (unix) or fc (dos) dirdiff.py:  12: # - use os.popen('ls...') or glob.glob + os.path.split dirdiff6.py:  19:     files1 = os.popen('ls %s' % dir1).readlines( ) dirdiff6.py:  20:     files2 = os.popen('ls %s' % dir2).readlines( ) testdirdiff.py:   8:     expected = expected + os.popen(test % 'dirdiff')... testdirdiff.py:  15:         output = output + os.popen(test % script).read( ) >>> import glob, grep >>> grep.grep('system', glob.glob('*.py')) dirdiff.py:  16: # - on unix systems we could do something similar by regtest.py:  18:         os.system('%s < %s > %s.out 2>&1' % (program, ... regtest.py:  23:         os.system('%s < %s > %s.out 2>&1' % (program, ... regtest.py:  24:         os.system('diff %s.out %s.out.bkp > %s.diffs' ... 

Unfortunately, the grep module, much like the original find module discussed at the end of Chapter 4, has been removed from the standard library in the time since I wrote this example for the second edition of this book (it was limited to printing results, and so is less general than other tools). On Unix systems, we can work around its demise by running a grep shell command from within a find shell command. For instance, the following Unix command line:

 find . -name "*.py" -print -exec fgrep popen {} \; 

would pinpoint lines and files at and below the current directory that mention popen. If you happen to have a Unix-like find command on every machine you will ever use, this is one way to process directories.

7.4.1.1. Cleaning up bytecode files

For instance, I used to run the script in Example 7-8 on some of my machines to remove all .pyc bytecode files in the examples tree before packaging or upgrading Pythons (it's not impossible that old binary bytecode files are not forward compatible with newer Python releases).

Example 7-8. PP3E\PyTools\cleanpyc.py

 ######################################################################### # find and delete all "*.pyc" bytecode files at and below the directory # where this script is run; this assumes a Unix-like find command, and # so is very nonportable; we could instead use the Python find module, # or just walk the directory trees with portable Python code; the find # -exec option can apply a Python script to each file too; ######################################################################### import os, sys if sys.platform[:3] == 'win':     findcmd = r'c:\stuff\bin.mks\find . -name "*.pyc" -print' else:     findcmd = 'find . -name "*.pyc" -print' print findcmd count = 0 for file in os.popen(findcmd).readlines( ):        # for all filenames     count += 1                                      # have \n at the end     print str(file[:-1])     os.remove(file[:-1]) print 'Removed %d .pyc files' % count 

This script uses os.popen to collect the output of a commercial package's find program installed on one of my Windows computers, or else the standard find tool on the Linux side. It's also completely nonportable to Windows machines that don't have the commercial Unix-like find program installed, and that includes other computers in my house, not to mention those throughout most of the world at large.

Python scripts can reuse underlying shell tools with os.popen, but by so doing they lose much of the portability advantage of the Python language. The Unix find command is not universally available and is a complex tool by itself (in fact, too complex to cover in this book; see a Unix manpage for more details). As we saw in Chapter 4, spawning a shell command also incurs a performance hit, because it must start a new independent program on your computer.

To avoid some of the portability and performance costs of spawning an underlying find command, I eventually recoded this script to use the find utilities we met and wrote in Chapter 4. The new script is shown in Example 7-9.

Example 7-9. PP3E\PyTools\cleanpyc-py.py

 ########################################################################## # find and delete all "*.pyc" bytecode files at and below the directory # where this script is run; this uses a Python find call, and so is # portable to most machines; run this to delete .pyc's from an old Python # release; cd to the directory you want to clean before running; ########################################################################## import os, sys, find              # here, gets PyTools find count = 0 for file in find.find("*.pyc"):   # for all filenames     count += 1     print file     os.remove(file) print 'Removed %d .pyc files' % count 

This works portably, and it avoids external program startup costs. But find is really just a tree searcher that doesn't let you hook into the tree searchif you need to do something unique while traversing a directory tree, you may be better off using a more manual approach. Moreover, find must collect all names before it returns; in very large directory trees, this may introduce significant performance and memory penalties. It's not an issue for my trees, but it could be for yours.

7.4.2. A Python Tree Searcher

To help ease the task of performing global searches on all platforms I might ever use, I coded a Python script to do most of the work for me. Example 7-10 employs the following standard Python tools that we met in the preceding chapters:

  • os.path.walk to visit files in a directory

  • find string method to search for a string in a text read from a file

  • os.path.splitext to skip over files with binary-type extensions

  • os.path.join to portably combine a directory path and filename

  • os.path.isdir to skip paths that refer to directories, not files

Because it's pure Python code, though, it can be run the same way on both Linux and Windows. In fact, it should work on any computer where Python has been installed. Moreover, because it uses direct system calls, it will likely be faster than using op.popen to spawn a find command that spawns many grep commands.

Example 7-10. PP3E\PyTools\search_all.py

 ############################################################################ # Use: "python ..\..\PyTools\search_all.py string". # search all files at and below current directory for a string; uses the # os.path.walk interface, rather than doing a find to collect names first; ############################################################################ import os, sys listonly = False skipexts = ['.gif', '.exe', '.pyc', '.o', '.a']        # ignore binary files def visitfile(fname, searchKey):                       # for each non-dir file     global fcount, vcount                              # search for string     print vcount+1, '=>', fname                        # skip protected files     try:         if not listonly:             if os.path.splitext(fname)[1] in skipexts:                 print 'Skipping', fname             elif open(fname).read( ).find(searchKey) != -1:                 raw_input('%s has %s' % (fname, searchKey))                 fcount += 1     except: pass     vcount += 1 def visitor(myData, directoryName, filesInDirectory):  # called for each dir     for fname in filesInDirectory:                     # do non-dir files here         fpath = os.path.join(directoryName, fname)     # fnames have no dirpath         if not os.path.isdir(fpath):                   # myData is searchKey             visitfile(fpath, myData) def searcher(startdir, searchkey):     global fcount, vcount     fcount = vcount = 0     os.path.walk(startdir, visitor, searchkey) if _ _name_ _ == '_ _main_ _':     searcher('.', sys.argv[1])     print 'Found in %d files, visited %d' % (fcount, vcount) 

This file also uses the sys.argv command-line list and the _ _name_ _ TRick for running in two modes. When run standalone, the search key is passed on the command line; when imported, clients call this module's searcher function directly. For example, to search (grep) for all appearances of the directory name "Part2" in the examples tree (an old directory that really did go away!), run a command line like this in a DOS or Unix shell:

 C:\...\PP3E>python PyTools\search_all.py Part2  1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py .\Launcher.py has Part2  5 => .\Launcher.pyc Skipping .\Launcher.pyc 6 => .\Launch_PyGadgets.py 7 => .\Launch_PyDemos.pyw 8 => .\LaunchBrowser.out.txt .\LaunchBrowser.out.txt has Part2  9 => .\LaunchBrowser.py .\LaunchBrowser.py has Part2  ...  ...more lines deleted ... 1339 => .\old_Part2\Basics\unpack2b.py 1340 => .\old_Part2\Basics\unpack3.py 1341 => .\old_Part2\Basics\_ _init_ _.py Found in 74 files, visited 1341 

The script lists each file it checks as it goes, tells you which files it is skipping (names that end in extensions listed in the variable skipexts that imply binary data), and pauses for an Enter key press each time it announces a file containing the search string (bold lines). A solution based on find could not pause this way; although trivial in this example, find doesn't return until the entire tree traversal is finished. The search_all script works the same way when it is imported rather than run, but there is no final statistics output line (fcount and vcount live in the module and so would have to be imported to be inspected here):

 >>> from PP3E.PyTools.search_all import searcher  >>> searcher('.', '-exec')           # find files with string '-exec' 1 => .\autoexec.bat 2 => .\cleanall.csh 3 => .\echoEnvironment.pyw 4 => .\Launcher.py 5 => .\Launcher.pyc Skipping .\Launcher.pyc 6 => .\Launch_PyGadgets.py 7 => .\Launch_PyDemos.pyw 8 => .\LaunchBrowser.out.txt 9 => .\LaunchBrowser.py 10 => .\Launch_PyGadgets_bar.pyw 11 => .\makeall.csh 12 => .\package.csh .\package.csh has -exec   ...more lines deleted... 

However launched, this script tracks down all references to a string in an entire directory treea name of a changed book examples file, object, or directory, for instance.[*]

[*] See the coverage of regular expressions in Chapter 21. The search_all script here searches for a simple string in each file with the string find method, but it would be trivial to extend it to search for a regular expression pattern match instead (roughly, just replace find with a call to a regular expression object's search method). Of course, such a mutation will be much more trivial after we've learned how to do it. Also notice the skipexts list in Example 7-10, which attempts to list all possible binary file types: it would be more general and robust to use the mimetypes logic we met at the end of Chapter 6 in order to guess file content type from its name.




Programming Python
Programming Python
ISBN: 0596009259
EAN: 2147483647
Year: 2004
Pages: 270
Authors: Mark Lutz

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net