7.8. Comparing Directory Trees
Engineers can be a
Because data CDs are accessible as simple directory trees, we are once again in the realm of tree walkersto verify a backup CD, we simply need to walk its top-level directory. If our script is general enough, we will also be able to use it to verify other copy operations as welle.g., downloaded tar files, hard-drive
We've already written a generic walker class (the visitor module), but it won't help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won't allow us to spot files and directories that exist only in the other. Something more custom seems in order here. 7.8.1. Finding Directory Differences
Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem
More
In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested and simpler operation, let's start by coding and debugging a single-directory comparison of filenames in Example 7-29. Example 7-29. PP3E\System\Filetools\dirdiff.py
Given listings of names in two directories, this script simply picks out unique names in the first and unique names in the second, and
Let's run this script on a few directories; differences are
C:\temp> python %X%\system\filetools\dirdiff.py examples cpexamples Comparing examples to cpexamples Directory lists are identical C:\temp> python %X%\system\filetools\dirdiff.py examples\PyTools cpexamples\PyTools Comparing examples\PyTools to cpexamples\PyTools Files unique to examples\PyTools ... visitor.py C:\temp> python %X%\system\filetools\dirdiff.py examples\System\Filetools cpexamples\System\Filetools Comparing examples\System\Filetools to cpexamples\System\Filetools Files unique to examples\System\Filetools ... dirdiff2.py Files unique to cpexamples\System\Filetools ... cpall.py
The
unique
function is the heart of this script: it
>>> L1 = [1, 3, 5, 7, 9] >>> L2 = [2, 3, 6, 8, 9] >>> from dirdiff import unique >>> unique(L1, L2) # items in L1 but not in L2 [1, 5, 7] >>> unique(L2, L1) # items in L2 but not in L1 [2, 6, 8] These two lists have objects 3 and 9 in common; the rest appear only in one of the two. When applied to directories, unique items represent tree differences, and common items are names of files or subdirectories that merit further comparisons or traversals. There are other ways to code this operation; see the dirdiff variants on the book's examples distribution for a few of them. In fact, in Python 2.4 and later, we could also use the built-in set object type if we don't care about the order in the results (we'll use our own functions for now to avoid requiring users to upgrade): >>> S1 = set([1, 3, 5, 7, 9]) >>> S2 = set([2, 3, 6, 8, 9]) >>> S1 - S2 # difference: unique set([1, 5, 7]) >>> S2 - S1 set([8, 2, 6]) >>> S1 & S2 # intersection: common set([9, 3]) 7.8.2. Finding Tree Differences
Now all we need is a tree walker that applies
dirdiff
at each level to pick out unique files and directories, explicitly
Example 7-30. PP3E\System\Filetools\diffall.py
At each directory in the tree, this script simply runs the
dirdiff
tool to detect unique names, and then compares names in common by intersecting directory lists. It uses recursive function calls to traverse the tree and
Notice the
misses
list, added in the third edition of this book; it's very
bytes1 = open(path1, 'rb').read( )
bytes2 = open(path2, 'rb').read( )
if bytes1 == bytes2:
...match...
else:
...difference...
This code was simpler, but is less practical for very large files that can't fit into your available memory space (consider CD and DVD image files, for example). In the new version's loop, the file reads return what is left when there is less than 1 MB present or remaining and return empty strings at end-of-file. Files match if all blocks read are the same, and they reach end-of-file at the same time. On some platforms, you may also want to detect and skip certain kinds of special files in order to be fully general, but these were not in my trees, so they are not in my script. 7.8.3. Running the Script
Since we've already studied the
C:\temp> python %X%\system\filetools\diffall.py examples cpexamples -------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents -------------------- Comparing examples\old_Part2 to cpexamples\old_Part2 Directory lists are identical Comparing contents -------------------- ...more lines deleted... -------------------- Comparing examples\EmbExt\Regist to cpexamples\EmbExt\Regist Directory lists are identical Comparing contents -------------------- Comparing examples\PyTools to cpexamples\PyTools Directory lists are identical Comparing contents ======================================== No diffs found.
I run this with the
verbose
flag passed in as
False
; use
TRue
to watch more status messages fly by. To show how differences are reported, we need to generate a few. Let's run the global
C:\temp\examples> python %X%\PyTools\visitor_replace.py -exec SPAM Are you sure? y ... 1355 => .\PyTools\visitor_find_quiet1.py 1356 => .\PyTools\fixeoln_one.doc.txt Visited 1356 files Changed 8 files: .\package.csh .\README-PP3E.txt .\readme-old-pp1E.txt .\temp .\remp .\Internet\Cgi-Web\fixcgi.py .\System\Processes\output.txt .\PyTools\cleanpyc.py While we're at it, let's remove a few common files so that directory uniqueness differences show up on the scope too; the following three removal commands will make two directories differ (the last two commands impact the same directory in different trees): C:\temp> rm cpexamples\PyTools\visitor.py C:\temp> rm cpexamples\System\Filetools\dirdiff2.py C:\temp> rm examples\System\Filetools\cpall.py
Now, rerun the comparison walker to pick out differences and redirect its output report to a file for easy inspection. The following lists just the
C:\temp>
python %X%\system\filetools\diffall.py examples cpexamples > diffs
C:\temp>
type diffs
--------------------
Comparing examples to cpexamples
Directory lists are identical
Comparing contents
package.csh DIFFERS
README-PP3E.txt DIFFERS
readme-old-pp1E.txt DIFFERS
temp DIFFERS
remp DIFFERS
--------------------
Comparing examples\old_Part2 to cpexamples\old_Part2
Directory lists are identical
Comparing contents
--------------------
...
--------------------
Comparing examples\Internet\Cgi-Web to cpexamples\Internet\Cgi-Web
Directory lists are identical
Comparing contents
fixcgi.py DIFFERS
--------------------
...
--------------------
Comparing examples\System\Filetools to cpexamples\System\Filetools
Files unique to examples\System\Filetools
... dirdiff2.py
Files unique to cpexamples\System\Filetools
... cpall.py
Comparing contents
--------------------
...
--------------------
Comparing examples\System\Processes to cpexamples\System\Processes
Directory lists are identical
Comparing contents
output.txt DIFFERS
--------------------
...
--------------------
Comparing examples\PyTools to cpexamples\PyTools
Files unique to examples\PyTools
... visitor.py
Comparing contents
cleanpyc.py DIFFERS
========================================
Diffs found: 10
- files differ at examples\package.csh - cpexamples\package.csh
- files differ at examples\README-PP3E.txt - cpexamples\README-PP3E.txt
- files differ at examples\readme-old-pp1E.txt - cpexamples\readme-old-pp1E.txt
- files differ at examples\temp - cpexamples\temp
- files differ at examples\remp - cpexamples\remp
- files differ at examples\Internet\Cgi-Web\fixcgi.py -
cpexamples\Internet\Cgi-Web\fixcgi.py
- unique files at examples\System\Filetools -
cpexamples\System\Filetools
- files differ at examples\System\Processes\output.txt -
cpexamples\System\Processes\output.txt
- unique files at examples\PyTools - cpexamples\PyTools
- files differ at examples\PyTools\cleanpyc.py - cpexamples\PyTools\cleanpyc.py
I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. Ten differences were foundthe eight files we changed (trashed) with the replacement script, and the two directories we
7.8.3.1. Verifying CD backups
So how does this script placate CD backup
C:\temp> python %X%\system\filetools\diffall.py examples g:\PP3rdEd\examples\PP3E > exdiffs091500 C:\temp> more exdiffs091500 -------------------- Comparing examples to g:\PP3rdEd\examples\PP3E Files unique to examples ... .cshrc Comparing contents tounix.py DIFFERS -------------------- Comparing examples\old_Part2 to g:\PP3rdEd\examples\PP3E\old_Part2 Directory lists are identical Comparing contents -------------------- ...more visitor_fixeoln.py DIFFERS visitor_fixnames.py DIFFERS ======================================== Diffs found: 41 - unique files at examples - g:\PP3rdEd\examples\PP3E - files differ at examples\tounix.py - g:\PP3rdEd\examples\PP3E\tounix.py ...more
The CD
C:\temp> python %X%\System\Filetools\diffall.py examples g:\PP3rdEd\examples\PP3E ... -------------------- Comparing examples\Config to g:\PP3rdEd\examples\PP3E\Config Files unique to examples\Config ... .cshrc Comparing contents ======================================== Diffs found: 1 - unique files at examples\Config - g:\PP3rdEd\examples\PP3E\Config And to be really sure, I run the following global comparison command against the true book directory to verify the entire book tree backup on CD: C:\> python %X%\System\Filetools\diffall.py PP3rdEd G:\PP3rdEd -------------------- Comparing PP3rdEd to G:\PP3rdEd Files unique to G:\PP3rdEd ... examples.tar.gz Comparing contents README.txt DIFFERS -------------------- ...more -------------------- Comparing PP3rdEd\examples\PP3E\Config to G:\PP3rdEd\examples\PP3E\Config Files unique to PP3rdEd\examples\PP3E\Config ... .cshrc Comparing contents -------------------- ...more -------------------- Comparing PP3rdEd\chapters to G:\PP3rdEd\chapters Directory lists are identical Comparing contents ch01-intro.doc DIFFERS ch04-os3.doc DIFFERS ch05-gui1.doc DIFFERS ch06-gui2.doc DIFFERS -------------------- ...more ======================================== Diffs found: 11 - unique files at PP3rdEd - G:\PP3rdEd - files differ at PP3rdEd\README.txt - G:\PP3rdEd\README.txt ...more
This particular run indicates that I've changed a README file, four chapter files, and a bunch more since the last backup; if run immediately after making a backup, only the
.cshrc
unique file shows up on
diffall
radar. This global comparison can take a few minutes. It performs byte-for-byte comparisons of all chapter files and screenshots, the examples tree, and more, but it's an accurate and complete verification. Given that this book tree contained
After writing this script, I also started using it to verify full backups of my laptops onto an external hard-drive device. To do so, I run the cpall copy script we met earlier, and then the comparison script here to check results and get a list of files that didn't copy correctly. The last time I did this, this procedure copied and compared 225,000 files and 15,000 directories in 20 GB of spacenot the sort of task that lends itself to manual labor!
Here are the magic incantations on my Windows laptop.
f:\
is a partition on my external hard drive, and you shouldn't be surprised if each of these commands runs for half an
C:\...\System\Filetools> cpall.py c:\ f:\ > f:\copy-log.txt C:\...\System\Filetools> diffall.py f:\ c:\ > f:\diff-log.txt 7.8.4. Reporting DifferencesFinally, it's worth noting that this script still only detects differences in the tree but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with string comparisons. It's a simple yes/no result. If and when I need more details about how two reported files actually differ, I either edit the files or run the file-comparison command on the host platform (e.g., fc on Windows/DOS, diff or cmp on Unix and Linux). That's not a portable solution for this last step; but for my purposes, just finding the differences in a 1,300-file tree was much more critical than reporting which lines differ in files flagged in the report.
Of course, since we can always run shell commands in Python, this last step could be automated by spawning a
diff
or
fc
command with
os.popen
as differences are
We also might try to do a bit better here by opening text files in text mode to ignore line-terminator differences, but it's not clear that such differences should be ignored (what if the caller wants to know whether line-end markers have been changed?). For example, after downloading a web site with an FTP script we'll meet in Chapter 14, the diffall script detected a discrepancy between the local copy of a file and the one at the remote server. To probe further, I simply ran some interactive Python code:
>>>
a = open('lp2e-updates.html', 'rb').read( )
>>>
b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read( )
>>>
a == b
False
This verifies that there really is a binary difference in the downloaded and local versions of the file; to see whether it's because a Unix or DOS line-end snuck into the file, try again in text mode so that line ends are all mapped to the standard \n character:
>>>
a = open('lp2e-updates.html', 'r').read( )
>>>
b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'r').read( )
>>>
a == b
True
Sure enough; now, to find where the difference is, the following code checks character by character until the first mismatch is found (in binary mode, so we retain the difference):
>>>
a = open('lp2e-updates.html', 'rb').read( )
>>>
b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read( )
>>>
for (i, (ac, bc)) in enumerate(zip(a, b)):
...
if ac != bc:
...
print i, repr(ac), repr(bc)
...
break
...
37966 '\r' '\n'
This means that at byte offset 37,966, there is a \r in the downloaded file, but a \n in the local copy. This line has a DOS line end in one and a Unix line end in the other. To see more, print text around the mismatch: >>> for (i, (ac, bc)) in enumerate(zip(a, b)): ... if ac != bc: ... print i, repr(ac), repr(bc) ... print repr(a[i-20:i+20]) ... print repr(b[i-20:i+20]) ... break ... 37966 '\r' '\n' 're>\r\ndef min(*args):\r\n tmp = list(arg' 're>\r\ndef min(*args):\n tmp = list(args'
Apparently, I wound up with a Unix line end at one point in the local copy and a DOS line end in the version I downloadedthe combined effect of the text mode used by the download script itself (which translated
\n
to
\r\n
) and
Because Python excels at processing files and strings, it's even possible to go one step further and code a Python equivalent of the fc and diff commands. In fact, much of the work has already been done; the standard library module difflib , new as of Python 2.1, could make this task simple. See the Python library manual for details and usage examples.
We could also be
Since such extensions are beyond both this script's scope and this chapter's size limits, though, they will have to await the attention of a
|