No account? Create an account

Tue, Mar. 30th, 2010, 11:10 am Patience Diff Advantages

There's no coherent explanation of what the advantages of Patience Diff are, so I'll explain now. First, a quick overview of how Patience Diff works -
1. Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.
2. Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.
3. Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
4. Do steps 1-2 on each section between matched lines
I've previously described it with the ordering a bit different and a recursive step at the end, but this ordering always gives the same result and performs much better, and the performance hit of doing the recursion isn't worth it, because it rarely if ever finds any more matches, and even when it does it isn't clear whether the extra matches produce a functionally superior diff. A much more detailed and more graphical explanation is here. A side by side of how that particular diff gets mangled by standard diff algorithms is here. This is the example which motivated patience diff in the first place. If you make extensive changes to a file and remove a function from the end of it and add a function to the beginning, there will be a tendency for an LCS based diff algorithm to match up all of the curly brackets instead of the functions, so if it's matching F1 F2 F3 to F4 F1 F2 instead of viewing F4 as new and F3 as deleted, it will match the curly brackets of F1 to F4, F2 to F1, and F3 to F2. The result is every bit as gross and useless as it sounds, and can easily force a merge conflict in cases which could have been resolved completely cleanly. This has a particularly strong tendency to happen if you put curly brackets on lines by themselves or are religious about putting curly brackets around single-line blocks even though it isn't technically necessary or even if you just put lots of blank lines between your functions.

Another advantage of patience diff is that it frequently doesn't match lines which just plain shouldn't match. For example, if you've completely rewritten a section of code it shouldn't match up the blank lines in each version, as this example shows. Finally, there's this example:
```

void func1() {
x += 1
}

+void functhreehalves() {
+    x += 1.5
+}
+
void func2() {
x += 2
}
```
Which is straightforward and obvious, but frequently diff algorithms will interpret it like this:
```

void func1() {
x += 1
+}
+
+void functhreehalves() {
+    x += 1.5
}

void func2() {
x += 2
}
```
While technically that works, it's obviously inferior and is a bit of a pet peeve of mine. In principle one could tweak an LCS-based diff to always do the right thing here, but the algorithmic complexity of LCS make implementations of it essentially unmaintainable. That's another another advantage of patience diff - it's simple and understandable that it can be modified and extended reasonably, and why it performs correctly in all these examples can be easily analyzed.
(Deleted comment)

Tue, Mar. 30th, 2010 11:56 pm (UTC)bramcohen

Straight LCS-based diff just does an LCS and it's done. Patience diff pulls out just the lines which occur exactly once on both sides, does an LCS on those, and extends out from there. This has the advantage of ignoring potentially misleading lines, and also is a much simpler algorithm due to the lack of duplicate lines.