You are viewing bramcohen

Tue, Mar. 30th, 2010, 11:10 am
Patience Diff Advantages

There's no coherent explanation of what the advantages of Patience Diff are, so I'll explain now. First, a quick overview of how Patience Diff works -
  1. Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.
  2. Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.
  3. Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
  4. Do steps 1-2 on each section between matched lines
I've previously described it with the ordering a bit different and a recursive step at the end, but this ordering always gives the same result and performs much better, and the performance hit of doing the recursion isn't worth it, because it rarely if ever finds any more matches, and even when it does it isn't clear whether the extra matches produce a functionally superior diff. A much more detailed and more graphical explanation is here. A side by side of how that particular diff gets mangled by standard diff algorithms is here. This is the example which motivated patience diff in the first place. If you make extensive changes to a file and remove a function from the end of it and add a function to the beginning, there will be a tendency for an LCS based diff algorithm to match up all of the curly brackets instead of the functions, so if it's matching F1 F2 F3 to F4 F1 F2 instead of viewing F4 as new and F3 as deleted, it will match the curly brackets of F1 to F4, F2 to F1, and F3 to F2. The result is every bit as gross and useless as it sounds, and can easily force a merge conflict in cases which could have been resolved completely cleanly. This has a particularly strong tendency to happen if you put curly brackets on lines by themselves or are religious about putting curly brackets around single-line blocks even though it isn't technically necessary or even if you just put lots of blank lines between your functions.

Another advantage of patience diff is that it frequently doesn't match lines which just plain shouldn't match. For example, if you've completely rewritten a section of code it shouldn't match up the blank lines in each version, as this example shows. Finally, there's this example:


 void func1() {
     x += 1
 }

+void functhreehalves() {
+    x += 1.5
+}
+
 void func2() {
     x += 2
 }
Which is straightforward and obvious, but frequently diff algorithms will interpret it like this:


 void func1() {
     x += 1
+}
+
+void functhreehalves() {
+    x += 1.5
 }
 
 void func2() {
     x += 2
 }
While technically that works, it's obviously inferior and is a bit of a pet peeve of mine. In principle one could tweak an LCS-based diff to always do the right thing here, but the algorithmic complexity of LCS make implementations of it essentially unmaintainable. That's another another advantage of patience diff - it's simple and understandable that it can be modified and extended reasonably, and why it performs correctly in all these examples can be easily analyzed.

Tue, Mar. 30th, 2010 09:55 pm (UTC)
elsmi

Your overview here says "longest common substring"; the post by says "longest common subsequence". If this is intended as the canonical explanation you should perhaps clarify which is meant just to be sure :-).

Tue, Mar. 30th, 2010 10:36 pm (UTC)
bramcohen

Fixed that, thanks. It's supposed to be subsequence. It's kinda confusing because both words are used for diff, and they both have the same acronym.

Wed, Mar. 31st, 2010 04:16 am (UTC)
elsmi

I know.

Tue, Mar. 30th, 2010 11:33 pm (UTC)
agthorr

I'm a little confused. You appear to be comparing Patience Diff with LCS-based diff algorithms, yet Patience Diff appears to make heavy use of LCS. I'm sure I'm missing something; what is it that I'm missing?

I'm not particularly familiar with the internals of diff algorithms (although I did implement a slick amortized worst-case O(n) longest common substring algorithm once).

Tue, Mar. 30th, 2010 11:56 pm (UTC)
bramcohen

Straight LCS-based diff just does an LCS and it's done. Patience diff pulls out just the lines which occur exactly once on both sides, does an LCS on those, and extends out from there. This has the advantage of ignoring potentially misleading lines, and also is a much simpler algorithm due to the lack of duplicate lines.

Thu, Apr. 8th, 2010 02:47 am (UTC)
jsyjr: Stand alone utility?

Is there a stand alone utility available? Something not embedded within a DVCS?

Thu, Apr. 8th, 2010 02:51 am (UTC)
bramcohen: Re: Stand alone utility?

I believe the diff code in bazaar, which uses patience diff (the extended one with recursive application, which I'm basically saying is unnecessary here) can be run as a standalone diff utility.

Thu, Apr. 8th, 2010 10:46 am (UTC)
jsyjr: Re: Stand alone utility?

If you are suggesting that I can say

$ bzr diff path1 path2

that does not work. I get

bzr: ERROR: Not a branch: "path1".

Wed, Jun. 22nd, 2011 01:52 pm (UTC)
axxyaan

I think you would do better to eliminate step 1 and step 2. Because they can make patience diff interpret the diff as the latter in your example here.

Suppose a new function is added, your functhreehalves here. Now what happens when you start comparing lines from the bottom up in your old and new file. You will start by finding the common function func2 in both files. But then you will find that the blank line and the end of func1 in the old file will match the blank line and the end of functhreehalves in the new file. So the result will be a diff in your pet peeve version.

Tue, Aug. 16th, 2011 05:37 pm (UTC)
ironcream

Thank you! :)
Very nice idea and really cool diff output.

Sun, Oct. 28th, 2012 03:01 am (UTC)
Dave Abrahams

+1 for the standalone utility.

Sun, Jan. 13th, 2013 02:16 am (UTC)
extempore

Git will behave as a standalone utility.

git diff --patience --no-index path1 path2