Mon, Apr. 30th, 2007, 01:53 am
The diff problem has been solved
Those of you who have used version control systems for a long time, or anything which has to diff extensively for that matter, know that once in a while they completely bungle a diff and give output which is, to put it charitably, extremely confusing.
Coming up with an algorithm which reliably gives good diffs is far from trivial, but is a problem which has now thankfully been solved. If you get the latest version of bazaar
and use its diff utility it will give reasonable diff output on almost anything. All other version control systems (and I mean all
others) should switch to using bazaar's diff algorithm as a swap-in replacement, a change which has essentially no downside. (It also has better asymptotic runtime, and code which is easier to understand and debug.)
Full disclosure: I came up with this diff algorithm, and bazaar uses my code :-) There's some discussion on the bazaar mailing list archives about interesting behavioral tweaks, but they don't appear to have been applied in the current release.
Update: My explanation of how patience diff works and why is here
Mon, Apr. 30th, 2007 09:46 am (UTC)
Instead of doing a longest common subsequence on everything, it does a longest common subsequence on lines which occur exactly once on both sides, then recurses between lines which got matched on that pass. Not sure if that qualifies as 'simple'.
Diffing for english text files is a different issue. Line-based diff algorithms are meant mostly for use on programming languages, which in turn are designed to work well when subjected to line-based diff.
Mon, Sep. 24th, 2007 11:49 pm (UTC)
(very belated, I know)it does a longest common subsequence on lines which occur exactly once on both sides
Yay, I came up with the exact same trick a few years ago, based on the observation that it was all those blank and "}" lines that tended to get screwed up in C code, while reading the diff paper that talked about the need for equivalence sets. And of course the algorithm gets simpler (and asymptotically faster) since you no longer need to manage equivalence sets. I'm glad somebody else thought of it and "published" it.
Mon, Apr. 30th, 2007 02:25 pm (UTC)
When I read your post I was very excited to see what you've done with the Diff problem. I downloaded a copy of bazaar but was disappointed to note it was 1) CLU only 2) the diffing mechanism seems tightly coupled to the source control.
I'm excited to try out your diffing algorithm purely on it's own. I think it would be much more useful as a separate application (with GUI) then as part of a whole new versioning system.
For example, I'm using perforce for all of my projects (beyond my control), but it's easy to swap in a new diffing tool for Perforce. I had hoped to try out your diffing solution in this way.
Mon, Apr. 30th, 2007 02:31 pm (UTC)
It would be really nice if this was a drop-in replacement for difflib, or otherwise done in a similar style.
For English diffs I use difflib reasonably successfully, splitting on individual words instead of lines (and in HTML I largely ignore the markup and pay attention only to words). It sounds like the improvements here might be useful in that domain as well.
Wed, May. 2nd, 2007 12:15 am (UTC)
bramcohen: Re: hmm...
There's a modification to patience diff to make it work on character-based instead of line-based input. Since it's unlikely that there are any individual characters which occur exactly once on both sides, what you do is take all substrings which occur exactly once on both sides, and view their first characters as potential matches, then do patience sorting on the potential match characters. I don't know what the possible asymptotics are for finding all potentially matching characters with this technique are.
Mon, Apr. 30th, 2007 05:06 pm (UTC)
Not sure what you mean by 'CLU only' but merge3.py can be run independently and gives unified diff output.
Mon, Apr. 30th, 2007 07:58 pm (UTC)
i see that
python -m bzrlib.merge3 a b c
does indeed give a three-way merged output of some sort, but the format doesn't match any of the versions classic diff3 produces, so far as i can tell. is there a reason for this? it would be nice to use this as a drop-in replacement, but a better algorithm is not all that usable until it can really drop in without changing anything else.
Mon, Apr. 30th, 2007 08:03 pm (UTC)
oh, and unless i am mistaken that output is not in unified diff format.
Wed, May. 2nd, 2007 05:07 pm (UTC)
Sorry, the correct file to use is patiencediff.py, which gives output which looks like unified diff to me, and which is claimed to be unified format elsewhere in the comments here. (It pulls in something from the bzr library, but that's just for logging and can be removed trivially).
Wed, May. 2nd, 2007 08:33 pm (UTC)
neat! works well, and i will play with it more when i have a moment.
Tue, May. 1st, 2007 02:29 am (UTC)
jameinel: difflib replacement
Actually, we took the Patience Diff algorithm and made it into a Sequence Matcher, which is the api that difflib uses.
We actually even broke it out into a separate module, which has its own 'main' in case you want to play with it.
It is just called "patiencediff.py" not "diff.py" or "merge3.py", those do other things :).
We implemented it as a SequenceMatcher which is the general api for difflib. The only other thing we had to do was copy and pasted 'unified_diff' and give it a parameter for a custom SequenceMatcher.
We did like Bram's algorithm. It does a few nice things.
1) Early versions of difflib used a recursive algorithm and could cause stack overflow (and really, really horrible O(N^2) or O(N^3) behavior) on certain files. PatienceDiff looks more like O(N log N), though it has larger Constants. So for small files difflib wins. For large ones PatienceDiff is a lot better.
2) It generally doesn't get confused by a bare "else:" statement in the middle after you've moved some code around. I'm not sure how many times I've tried hard to parse a diff when 'else: ' in the middle of a complex statement broke a diff chunk into 2 instead of 1.
There is only 1 downside that I know of, and that doesn't happen very often. Because Patience Diff only uses unique lines as initial match points, if you have copied code, it can cause larger-than-expected diffs. For example:
If you just change start and repeated it will mark the whole thing as new (even though the whole middle section is unchanged).
We looked into using standard difflib in between matching sections to overcome that, but we ended up with even worse performance. (When the whole 100k-line file doesn't match you pay one attempt for Patience, and then another for difflib so now you have O (N^2 + N log N))
The other things we have played with doing (we implemented it, but didn't merge it into core yet)
1) Stripping whitespace from lines and using that as the "unique line" metric.
2) Doing a full match on stripped lines, and then going through and "unmatching" lines which accidentally matched. This has all of the properties of (1), and additionally causes sections which were just indented/unindented to stay lined up.
Wed, May. 2nd, 2007 12:05 am (UTC)
bramcohen: Re: difflib replacement
Technically speaking, in the 'common code' example you gave the blank line will match up.
Can bazaar be made to give unified diffs? If so, what's the command?
I like that second suggestion of doing the full match on stripped lines and then unmatching. It should be computationally cheap, behave predictably and reasonably, and possibly lead to using the extra semantic information it provides about diffs.
Wed, May. 2nd, 2007 03:48 am (UTC)
jameinel: Re: difflib replacement
If you are using bzr for general versioning do something like:
bzr add foo
bzr commit -m foo
# modify foo
Will give a Unified diff using Patience.
If you just want to use Patience, you can also do
python bzrlib/patiencediff.py file1 file2
And it should diff them for you.
Wed, May. 2nd, 2007 04:21 pm (UTC)
jameinel: Re: difflib replacement
Oh, and in this exact example, you are right. But I have encountered this "in the wild". All I really remember was that it was a test case, and so a lot of the code between two points was identical. None of it was modified, but because it wasn't unique to the sub-section it wasn't matched.
I think one of the possible fixes Aaron Bentley mentioned was to expand the uniqueness algorithm to treat multiple lines as the uniqueness primitive.
So if you had:
While none of a, b, or c are unique, the combinations of ab, ac, and bc are.
I'm not sure if that would fall under the "computationally inexpensive" part, though.
Oh, and the downside of matching stripped lines and removing the matches later is that you have another O(N). It is relatively cheap, but it does mean iterating over the lines another time.
One possible way to make it cheaper is to keep both copies of all the lines around, and have a flag for "exact match" versus "whitespace only match". It wouldn't really work in the current implementation, but if someone was re-implementing it (like in C where having structs() is cheaper than objects are in Python).
Wed, May. 2nd, 2007 05:14 pm (UTC)
bramcohen: Re: difflib replacement
I wouldn't really call a test case 'in the wild'. Any plausible in the wild case I've thought of doing a conflict on the whole section would be much more understandable than breaking it up. (Blocks of all whitespace are of course handled by the match from beginning special case.)
The uniqueness of blocks technique is a good one, and would be necessary to apply this approach to character-based diffing, but runs into some subtleties when you have, for example, abc vs. abxbc. It's also far from clear what the best technique algorithmically is for implementing it. That said, it should work very well on the difficult problem of character-based diffing.
The single pass for whitespace stripping is so fast that it might be completely ignorable, unless you're so concerned about speed that percentages matter, which for most applications of diff I think they don't.
Mon, Apr. 30th, 2007 09:55 pm (UTC)
Where's tha paper? :D
Tue, May. 1st, 2007 08:52 pm (UTC)
Do you mean "has been solved" in some theoretical sense of optimality?
Wed, May. 2nd, 2007 12:10 am (UTC)
It's been solved in the handwavy 'we got it to stop giving crap output so often' sense.
Sat, May. 9th, 2009 05:49 am (UTC)
friendlydog: A definitive solution to an important problem should be published
So, ya gonna publish?