Problem
You want to see if two files contain the same data. If they differ, you might want to represent the differences between them as a string: a patch from one to the other.
Solution
If two files differ, it's likely that their sizes also differ, so you can often solve the problem quickly by comparing sizes. If both files are regular files with the same size, you'll need to look at their contents.
This code does the cheap checks first:
class File def File.same_contents(p1, p2) return false if File.exists?(p1) != File.exists?(p2) return true if !File.exists?(p1) return true if File.expand_path(p1) == File.expand_path(p2) return false if File.ftype(p1) != File.ftype(p2) || File.size(p1) != File.size(p2)
Otherwise, it compares the files contents, a block at a time:
open(p1) do |f1| open(p2) do |f2| blocksize = f1.lstat.blksize same = true while same && !f1.eof? && !f2.eof? same = f1.read(blocksize) == f2.read(blocksize) end return same end end end end
To illustrate, I'll create two identical files and compare them. I'll then make them slightly different, and compare them again.
1.upto(2) do |i| open("output#{i}", 'w') { |f| f << 'x' * 10000 } end File.same_contents('output1', 'output2') # => true open("output1", 'a') { |f| f << 'x' } open("output2", 'a') { |f| f << 'y' } File.same_contents('output1', 'output2') # => false File.same_contents('nosuchfile', 'output1') # => false File.same_contents('nosuchfile1', 'nosuchfile2') # => true
Discussion
The code in the Solution works well if you only need to determine whether two files are identical. If you need to see the differences between two files, the most useful tool is is Austin Ziegler's Diff::LCS library, available as the diff-lcs gem. It implements a sophisticated diff algorithm that can find the differences between any two enumerable objects, not just strings. You can use its LCS module to represent the differences between two nested arrays, or other complex data structures.
The downside of such flexibility is a poor interface when you just want to diff two files or strings. A diff is represented by an array of Change objects, and though you can traverse this array in helpful ways, there's no simple way to just turn it into a string representation of the sort you might get by running the Unix command diff.
Fortunately, the lcs-diff gem comes with command-line diff programs ldiff and htmldiff. If you need to perform a textual diff from within Ruby code, you can do one of the following:
Here's some code, adapted from the ldiff command-line program, which builds a string representation of the differences between two strings. The result is something you might see by running ldiff, or the Unix command diff. The most common diff formats are :unified and :context.
require 'rubygems' require 'diff/lcs/hunk' def diff_as_string(data_old, data_new, format=:unified, context_lines=3)
First we massage the data into shape for the diff algorithm:
data_old = data_old.split(/ /).map! { |e| e.chomp } data_new = data_new.split(/ /).map! { |e| e.chomp }
Then we perform the diff, and transform each "hunk" of it into a string:
output = "" diffs = Diff::LCS.diff(data_old, data_new) return output if diffs.empty? oldhunk = hunk = nil file_length_difference = 0 diffs.each do |piece| begin hunk = Diff::LCS::Hunk.new(data_old, data_new, piece, context_lines, file_length_difference) file_length_difference = hunk.file_length_difference next unless oldhunk # Hunks may overlap, which is why we need to be careful when our # diff includes lines of context. Otherwise, we might print # redundant lines. if (context_lines > 0) and hunk.overlaps?(oldhunk) hunk.unshift(oldhunk) else output << oldhunk.diff(format) end ensure oldhunk = hunk output << " " end end #Handle the last remaining hunk output << oldhunk.diff(format) << " " end
Here it is in action:
s1 = "This is line one. This is line two. This is line three. " s2 = "This is line 1. This is line two. This is line three. " + "This is line 4. " puts diff_as_string(s1, s2) # @@ -1,4 +1,5 @@ # -This is line one. # +This is line 1. # This is line two. # This is line three. # +This is line 4.
With all that code, on a Unix system you could be forgiven for just calling out to the Unix diff program:
open('old_file', 'w') { |f| f << s1 } open('new_file', 'w') { |f| f << s2 } puts %x{diff old_file new_file} # 1c1 # < This is line one. # --- # > This is line 1. # 3a4 # > This is line 4.
See Also
Strings
Numbers
Date and Time
Arrays
Hashes
Files and Directories
Code Blocks and Iteration
Objects and Classes8
Modules and Namespaces
Reflection and Metaprogramming
XML and HTML
Graphics and Other File Formats
Databases and Persistence
Internet Services
Web Development Ruby on Rails
Web Services and Distributed Programming
Testing, Debugging, Optimizing, and Documenting
Packaging and Distributing Software
Automating Tasks with Rake
Multitasking and Multithreading
User Interface
Extending Ruby with Other Languages
System Administration