Walking a Directory Tree | Files and Directories

Problem

You want to recursively process every subdirectory and file within a certain directory.

Solution

Suppose that the directory tree you want to walk looks like this (see this chapter's introduction section for the create_tree library that can build this directory tree automatically):

	require 'create_tree'
	create_tree './' =>
	 [ 'file1',
	 'file2',
	 { 'subdir1/' => [ 'file1' ] },
	 { 'subdir2/' => [ 'file1',
	 'file2',
	 { 'subsubdir/' => [ 'file1' ] }
	 ]
	 }
	 ]

The simplest solution is to load all the files and directories into memory with a big recursive file glob, and iterate over the resulting array. This uses a lot of memory because all the filenames are loaded into memory at once:

	Dir['**/**']
	# => ["file1", "file2", "subdir1", "subdir2", "subdir1/file1",
	# "subdir2/file1", "subdir2/file2", "subdir2/subsubdir",
	# "subdir2/subsubdir/file1"]

A more elegant solution is to use the find method in the Find module. It performs a depth-first traversal of a directory tree, and calls the given code block on each directory and file. The code block should take as an argument the full path to a directory or file.

This snippet calls Find.find with a code block that simply prints out each path it receives. This demonstrates how Ruby performs the traversal:

	require 'find'
	 
Find.find('./') { |path| puts path }
	# ./
	# ./subdir2
	# ./subdir2/subsubdir
	# ./subdir2/subsubdir/file1
	# ./subdir2/file2
	# ./subdir2/file1
	# ./subdir1
	# ./subdir1/file1
	# ./file2
	# ./file1

Discussion

Even if you're not a system administrator, the demands of keeping your own files organized will frequently call for you to process every file in a directory tree. You may want to backup, modify, or delete each file in the directory structure, or you may just want to see what's there.

Normally you'll want to at least look at every file in the tree, but sometimes you'll want to skip certain directories. For instance, you might know that a certain directory is full of a lot of large files you don't want to process. When your block is passed a path to a directory, you can prevent Find.find from recursing into a directory by calling Find.prune. In this example, I'll prevent Find.find from processing the files in the subdir2 directory.

	Find.find('./') do |path|
	 
Find.prune if File.basename(path) == 'subdir2'
	 puts path
	end
	# ./
	# ./subdir1
	# ./subdir1/file1
	# ./file2
	# ./file1

Calling Find.prune when your block has been passed a file will only prevent Find.find from processing that one file. It won't halt the processing of the rest of the files in that directory:

	Find.find('./') do |path|
	 if File.basename(path) =~ /file2$/
	 puts "PRUNED #{path}"
	 Find.prune
	 end
	 puts path
	end
	# ./
	# ./subdir2
	# ./subdir2/subsubdir
	# ./subdir2/subsubdir/file1
	# PRUNED ./subdir2/file2
	# ./subdir2/file1
	# ./subdir1
	# ./subdir1/file1
	# PRUNED ./file2
	# ./file1

Find.find works by keeping a queue of files to process. When it finds a directory, it inserts that directory's files at the beginning of the queue. This gives it the characteristics of a depth-first traversal. Note how all the files in the top-level directory are processed after the subdirectories. The alternative would be a breadth-first traversal, which would process the files in a directory before even touching the subdirectories.

If you want to do a breadth-first traversal instead of a depth-first one, the simplest solution is to use a glob and sort the resulting array. Pathnames sort naturally in a way that simulates a breadth-first traversal:

	Dir["**/**"].sort.each { |x| puts x }
	# file1
	# file2
	# subdir1
	# subdir1/file1
	# subdir2
	# subdir2/file1
	# subdir2/file2
	# subdir2/subsubdir
	# subdir2/subsubdir/file1