Finding Duplicate Files

Table of contents:

Problem

You want to find the duplicate files that are taking up all the space on your hard drive.

Solution

The simple solution is to group the files by size and then by their MD5 checksum. Two files are presumed identical if they have the same size and MD5 sum.

The following program takes a list of directories on the command line, and prints out all sets of duplicate files. You can pass a different code block into each_set_of_ duplicates for different behavior: for instance, to prompt the user about which of the duplicates to keep and which to delete.

	#!/usr/bin/ruby
	# find_duplicates.rb
	
	require find
	require digest/md5

	def each_set_of_duplicates(*paths)
	 sizes = {}
	 Find.find(*paths) do |f|
	 (sizes[File.size(f)] ||= []) << f if File.file? f
	 end
	 sizes.each do |size, files|
	 next unless files.size > 1
	 md5s = {}
	 files.each do |f|
	 digest = Digest::MD5.hexdigest(File.read(f))
	 (md5s[digest] ||= []) << f
	 end
	 md5s.each { |sum, files| yield files if files.size > 1 }
	 end
	end

	each_set_of_ 
duplicates(*ARGV) do |f|
	 puts " 
Duplicates: #{f.join(", ")}"
	end

Discussion

This is one task that can be handled with a simple Find.find code block, because its trying to figure out which files have certain relationships to each other. Find.find takes care of walking the file tree, but it would be very inefficient to try to make a single trip through the tree and immediately spit out a set of duplicates. Instead, we group the files by size and then by their MD5 checksum.

The MD5 checksum is a short binary string used as a stand-in for the contents of a file. Its commonly used to verify that a huge file was downloaded without errors. Its not impossible for two different files to have an MD5 sum, but unless someone is deliberately trying to trick you, its almost impossible to have two files with the same size and the same MD5 sum.

Calculating a MD5 sum is very expensive: it means performing a mathematical calculation on the entire contents of the file. Grouping the files by size beforehand greatly reduces the number of sums that must be calculated, but thats still a lot of I/O. Even if two similarly sized files differ in the first byte, the code above will read the entire files.

Heres a different version of the same program that takes an incremental approach like that seen in Recipe 6.10. When it thinks a set of files might contain duplicates, it makes repeated calls to a method called eliminate_non_duplicates. The duplicates are yielded and the nonduplicates discarded over the course of these calls.

	#!/usr/bin/ruby
	# find_duplicates2.rb

	require find
	BLOCK_SIZE = 1024*8

	def each_set_of_duplicates(*paths, &block)
	 sizes = Hash.new {|h, k| h[k] = [] }
	 Find.find(*paths) { |f| sizes[File.size(f)] << f if File.file? f }

	 sizes.each_pair do |size, files|
	 next unless files.size > 1
	 offset = 0
	 files = [files]
	 while !files.empty? && offset <= size
	 files = eliminate_non_ 
duplicates(files, size, offset, &block)
	 offset += BLOCK_SIZE
	 end
	 end
	end

The method eliminate_non_ duplicates takes lists of files that might contain duplicates. It reads each file an eight-kilobyte block at a time, and compares just one block of each file. Files whose blocks don match the corresponding blocks of any other file are discarded; they e not duplicates. All files with the same block are put into a new list of possible duplicates, and sent back to each_set_of_duplicates.

If two files are not duplicates, eliminate_non_duplicates will eventually find a block where they differ. Otherwise, it will eventually read the last block of each file and confirm them as duplicates.

	def eliminate_non_duplicates(partition, size, offset)
	 possible_duplicates = []
	 partition.each do |possible_duplicate_set|
	 blocks = Hash.new {|h, k| h[k] = [] }
	 possible_duplicate_set.each do |f|
	 block = open(f, 
b) do |file|
	 file.seek(offset)
	 file.read(BLOCK_SIZE)
	 end
	 blocks[block || \] << f
	 end
	 blocks.each_value do |files|
	 if files.size > 1
	 if offset+BLOCK_SIZE >= size
	 # We know these are duplicates.
	 yield files
	 else
	 # We suspect these are duplicates, but we need to compare
	 # more blocks of data.
	 possible_duplicates << files
	 end
	 end
	 end
	 end
	 return possible_duplicates
	end

	each_set_of_duplicates(*ARGV) do |f|
	 puts "Duplicates: #{f.join(", ")}"
	end

This code is more complicated, but in real-world situations, its considerably faster. Most files of the same size are not duplicates, and its cheaper to find this out by reading eight kilobytes than by reading many megabytes and then performing two MD5 sums. This solution also eliminates any last possibility that each_set_of_ duplicates will claim two files are duplicates when they e not.

Problem

Solution

Discussion

See Also