Log in

No account? Create an account

Fri, May. 6th, 2005, 07:24 pm
More version control stuff

Linus Torvalds thinks that Git is already better at merging than everything else. In the sense that Linus personally has a magic merge algorithm knows as 'Andrew Morton' that's correct, but for everybody else this claim is completely false. (I asked around about what he could mean, and some people suggested that 'better' might have meant 'faster', which is the only explanation which doesn't involve Linus being delusional or lying.)

Unfortunately Linus's stated plans for Git are that it should only ever have whole tree three way merging (because that's the simple, practical, expedient way) and that it will have implicit renames and moves of lines between files (because that's the 'right' way). The resulting approach isn't workable even on paper. Whether the plans will be changed in advance of a disaster remains to be seen.

Meanwhile, we recently made a big advance in merge algorithms, you can read the latest about the upcoming new Codeville merge algorithm in this post.

Sat, May. 7th, 2005 12:08 pm (UTC)
(Anonymous): diff

Not sure what you mean by "diff isn't sufficient". There are some reasons to prefer something besides standard longest-common-subsequence to play the role of "diff", but it's more-or-less irrelevant to the merge algorithm what algorithm you decide to use here.

In the simple case, where you have only one parent, then yes, the algorithm you describe works. (Except note that "where to insert the novel lines" isn't unspecified; new lines are always larger than any old lines always go to the end.)

The nasty case is when you have two (or more, I suppose) parents, after a merge. Then you have to somehow align with both, while preserving the order invariants. That's what all the complexity in there is for...

-- Nathaniel

Sat, May. 7th, 2005 02:20 pm (UTC)
ciphergoth: Re: diff

Surely if you have two parents you merge them *first* then use the merged revision as the ancestor to diff against?

It sounds like I don't really understand what's going on. I hope I won't try your patience too much if I explain what I think is going on so you can tell me what my misconception is.

There's a set of line identifiers currently in use, and a total ordering on them.

For each line identifier that's in use, there's a string representing the contents of that line. Two identifers can map to the same string.

For any two identifiers in use, if there's no identifier in use that's between them, I can ask for a new identifier inbetween them and specify what string it maps to. Similarly I can ask for an identifier that's greater than or less than any existing identifier.

Given this, a revision is simply a partial map from the identifier set to a status of either "alive" or "dead". Lines not in the partial domain of that function are "unborn". And given that, merging is simply the creation of a new partial map from two existing partial maps using the rules set out above (dead wins against alive and unborn loses to both).

So merging synthesizes a new annotated revision from two existing annotated revisions.

The difficult problem is going from a sequence of strings to a revision as defined above. However, I only ever have to compare against one ancestor when I do this; if I have two or more ancestors, I can first perform the merge algorithm described to generate a single ancestor, and then use that.

This algorithm is really neat. It's a bad fit for how I had originally imagined a VCS should work, but that's not necessarily a point against it.

Sat, May. 7th, 2005 05:07 pm (UTC)
bramcohen: Re: diff

The problem with trying a single ancestor first is that if there was a conflict different hunks might have won in the result, and the wrong side of a conflict might have a match of a single blank line or some similar nonsense and thus invalidate the entire other side.

It might make sense to match against the 'mash-up' of all the ancestors - that is, the result of simply smushing their weaves together, without checking for conflicts. This seems like an alarmingly odd approach, but we recently figured out a new weave ordering algorithm which makes the mash-up look considerably less funky, and I'm slowly warming up to the idea. Simple two-way diffs certainly have aesthetic appeal.

All that stuff about unique lines is just to improve the behavior and run-time of two-way diffs, by the way. We don't want to get erroneous matches on '}' lines.

Mon, May. 9th, 2005 01:38 pm (UTC)
ciphergoth: Re: diff

Doing it against the "mash-up" makes a lot of sense. Especially if you can tweak the "diff" algorithm to prefer matches using more-closely-related lines over less-closely-related if the length of the LCS is the same. That seems perfectly practical for the naive O(nm)-time dynamic programming "diff" algorithm I got taught at Uni; I don't know if it's just as easy for the more sophisticated algorithms used in practice.