You are viewing bramcohen

Mon, Apr. 30th, 2007, 01:53 am
The diff problem has been solved

Those of you who have used version control systems for a long time, or anything which has to diff extensively for that matter, know that once in a while they completely bungle a diff and give output which is, to put it charitably, extremely confusing.

Coming up with an algorithm which reliably gives good diffs is far from trivial, but is a problem which has now thankfully been solved. If you get the latest version of bazaar and use its diff utility it will give reasonable diff output on almost anything. All other version control systems (and I mean all others) should switch to using bazaar's diff algorithm as a swap-in replacement, a change which has essentially no downside. (It also has better asymptotic runtime, and code which is easier to understand and debug.)

Full disclosure: I came up with this diff algorithm, and bazaar uses my code :-) There's some discussion on the bazaar mailing list archives about interesting behavioral tweaks, but they don't appear to have been applied in the current release.

Update: My explanation of how patience diff works and why is here

Mon, Apr. 30th, 2007 09:27 am (UTC)
4zumanga

Is there a simple idea which makes your implementation better? Or is it just cleaner? (I could go and look at the code I suppose)

Personally, I've always fancied a diff which was better when the changes are a single word in a paragraph, as most diff systems are line (and therefore paragraph, as I don't hard word wrap) based/

Mon, Apr. 30th, 2007 09:46 am (UTC)
bramcohen

Instead of doing a longest common subsequence on everything, it does a longest common subsequence on lines which occur exactly once on both sides, then recurses between lines which got matched on that pass. Not sure if that qualifies as 'simple'.

Diffing for english text files is a different issue. Line-based diff algorithms are meant mostly for use on programming languages, which in turn are designed to work well when subjected to line-based diff.

Mon, Apr. 30th, 2007 09:58 am (UTC)
4zumanga

Sounds fairly simple. I would expect some theorists to not like it as you could purposefully produce examples where it misbehaved, but I wouldn't expect such cases to occur in practice.

Mon, Sep. 24th, 2007 11:49 pm (UTC)
nothings

(very belated, I know)

it does a longest common subsequence on lines which occur exactly once on both sides

Yay, I came up with the exact same trick a few years ago, based on the observation that it was all those blank and "}" lines that tended to get screwed up in C code, while reading the diff paper that talked about the need for equivalence sets. And of course the algorithm gets simpler (and asymptotically faster) since you no longer need to manage equivalence sets. I'm glad somebody else thought of it and "published" it.

Mon, Apr. 30th, 2007 02:25 pm (UTC)
morallybass

When I read your post I was very excited to see what you've done with the Diff problem. I downloaded a copy of bazaar but was disappointed to note it was 1) CLU only 2) the diffing mechanism seems tightly coupled to the source control.

I'm excited to try out your diffing algorithm purely on it's own. I think it would be much more useful as a separate application (with GUI) then as part of a whole new versioning system.

For example, I'm using perforce for all of my projects (beyond my control), but it's easy to swap in a new diffing tool for Perforce. I had hoped to try out your diffing solution in this way.

Mon, Apr. 30th, 2007 02:31 pm (UTC)
ianbicking: hmm...

It would be really nice if this was a drop-in replacement for difflib, or otherwise done in a similar style.

For English diffs I use difflib reasonably successfully, splitting on individual words instead of lines (and in HTML I largely ignore the markup and pay attention only to words). It sounds like the improvements here might be useful in that domain as well.

Wed, May. 2nd, 2007 12:15 am (UTC)
bramcohen: Re: hmm...

There's a modification to patience diff to make it work on character-based instead of line-based input. Since it's unlikely that there are any individual characters which occur exactly once on both sides, what you do is take all substrings which occur exactly once on both sides, and view their first characters as potential matches, then do patience sorting on the potential match characters. I don't know what the possible asymptotics are for finding all potentially matching characters with this technique are.

Mon, Apr. 30th, 2007 05:06 pm (UTC)
bramcohen

Not sure what you mean by 'CLU only' but merge3.py can be run independently and gives unified diff output.

Mon, Apr. 30th, 2007 07:58 pm (UTC)
bsittler

i see that python -m bzrlib.merge3 a b c does indeed give a three-way merged output of some sort, but the format doesn't match any of the versions classic diff3 produces, so far as i can tell. is there a reason for this? it would be nice to use this as a drop-in replacement, but a better algorithm is not all that usable until it can really drop in without changing anything else.

Mon, Apr. 30th, 2007 08:03 pm (UTC)
bsittler

oh, and unless i am mistaken that output is not in unified diff format.

Wed, May. 2nd, 2007 05:07 pm (UTC)
bramcohen

Sorry, the correct file to use is patiencediff.py, which gives output which looks like unified diff to me, and which is claimed to be unified format elsewhere in the comments here. (It pulls in something from the bzr library, but that's just for logging and can be removed trivially).

Wed, May. 2nd, 2007 08:33 pm (UTC)
bsittler

neat! works well, and i will play with it more when i have a moment.

Tue, May. 1st, 2007 02:29 am (UTC)
jameinel: difflib replacement

Actually, we took the Patience Diff algorithm and made it into a Sequence Matcher, which is the api that difflib uses.

We actually even broke it out into a separate module, which has its own 'main' in case you want to play with it.

It is just called "patiencediff.py" not "diff.py" or "merge3.py", those do other things :).

We implemented it as a SequenceMatcher which is the general api for difflib. The only other thing we had to do was copy and pasted 'unified_diff' and give it a parameter for a custom SequenceMatcher.


We did like Bram's algorithm. It does a few nice things.

1) Early versions of difflib used a recursive algorithm and could cause stack overflow (and really, really horrible O(N^2) or O(N^3) behavior) on certain files. PatienceDiff looks more like O(N log N), though it has larger Constants. So for small files difflib wins. For large ones PatienceDiff is a lot better.

2) It generally doesn't get confused by a bare "else:" statement in the middle after you've moved some code around. I'm not sure how many times I've tried hard to parse a diff when 'else: ' in the middle of a complex statement broke a diff chunk into 2 instead of 1.

There is only 1 downside that I know of, and that doesn't happen very often. Because Patience Diff only uses unique lines as initial match points, if you have copied code, it can cause larger-than-expected diffs. For example:

start
common
code

common
code
repeated


If you just change start and repeated it will mark the whole thing as new (even though the whole middle section is unchanged).

We looked into using standard difflib in between matching sections to overcome that, but we ended up with even worse performance. (When the whole 100k-line file doesn't match you pay one attempt for Patience, and then another for difflib so now you have O (N^2 + N log N))

The other things we have played with doing (we implemented it, but didn't merge it into core yet)

1) Stripping whitespace from lines and using that as the "unique line" metric.

2) Doing a full match on stripped lines, and then going through and "unmatching" lines which accidentally matched. This has all of the properties of (1), and additionally causes sections which were just indented/unindented to stay lined up.


Wed, May. 2nd, 2007 12:05 am (UTC)
bramcohen: Re: difflib replacement

Technically speaking, in the 'common code' example you gave the blank line will match up.

Can bazaar be made to give unified diffs? If so, what's the command?

I like that second suggestion of doing the full match on stripped lines and then unmatching. It should be computationally cheap, behave predictably and reasonably, and possibly lead to using the extra semantic information it provides about diffs.

Wed, May. 2nd, 2007 03:48 am (UTC)
jameinel: Re: difflib replacement

If you are using bzr for general versioning do something like:

bzr init
bzr add foo
bzr commit -m foo

# modify foo

bzr diff

Will give a Unified diff using Patience.

If you just want to use Patience, you can also do

python bzrlib/patiencediff.py file1 file2

And it should diff them for you.

Wed, May. 2nd, 2007 04:21 pm (UTC)
jameinel: Re: difflib replacement

Oh, and in this exact example, you are right. But I have encountered this "in the wild". All I really remember was that it was a test case, and so a lot of the code between two points was identical. None of it was modified, but because it wasn't unique to the sub-section it wasn't matched.

I think one of the possible fixes Aaron Bentley mentioned was to expand the uniqueness algorithm to treat multiple lines as the uniqueness primitive.

So if you had:

a
b
a
c
b
c

While none of a, b, or c are unique, the combinations of ab, ac, and bc are.

I'm not sure if that would fall under the "computationally inexpensive" part, though.

Oh, and the downside of matching stripped lines and removing the matches later is that you have another O(N). It is relatively cheap, but it does mean iterating over the lines another time.

One possible way to make it cheaper is to keep both copies of all the lines around, and have a flag for "exact match" versus "whitespace only match". It wouldn't really work in the current implementation, but if someone was re-implementing it (like in C where having structs() is cheaper than objects are in Python).

Wed, May. 2nd, 2007 05:14 pm (UTC)
bramcohen: Re: difflib replacement

I wouldn't really call a test case 'in the wild'. Any plausible in the wild case I've thought of doing a conflict on the whole section would be much more understandable than breaking it up. (Blocks of all whitespace are of course handled by the match from beginning special case.)

The uniqueness of blocks technique is a good one, and would be necessary to apply this approach to character-based diffing, but runs into some subtleties when you have, for example, abc vs. abxbc. It's also far from clear what the best technique algorithmically is for implementing it. That said, it should work very well on the difficult problem of character-based diffing.

The single pass for whitespace stripping is so fast that it might be completely ignorable, unless you're so concerned about speed that percentages matter, which for most applications of diff I think they don't.

Mon, Apr. 30th, 2007 09:55 pm (UTC)
taral

Where's tha paper? :D

Tue, May. 1st, 2007 08:52 pm (UTC)
hsenag

Do you mean "has been solved" in some theoretical sense of optimality?

Wed, May. 2nd, 2007 12:10 am (UTC)
bramcohen

It's been solved in the handwavy 'we got it to stop giving crap output so often' sense.

Sat, May. 9th, 2009 05:49 am (UTC)
friendlydog: A definitive solution to an important problem should be published

So, ya gonna publish?