Log in

No account? Create an account

Fri, May. 6th, 2005, 07:24 pm
More version control stuff

Linus Torvalds thinks that Git is already better at merging than everything else. In the sense that Linus personally has a magic merge algorithm knows as 'Andrew Morton' that's correct, but for everybody else this claim is completely false. (I asked around about what he could mean, and some people suggested that 'better' might have meant 'faster', which is the only explanation which doesn't involve Linus being delusional or lying.)

Unfortunately Linus's stated plans for Git are that it should only ever have whole tree three way merging (because that's the simple, practical, expedient way) and that it will have implicit renames and moves of lines between files (because that's the 'right' way). The resulting approach isn't workable even on paper. Whether the plans will be changed in advance of a disaster remains to be seen.

Meanwhile, we recently made a big advance in merge algorithms, you can read the latest about the upcoming new Codeville merge algorithm in this post.

Sat, May. 7th, 2005 09:28 am (UTC)

Are any of the other core developers participating in this debate? I wish Linus weren't like this...

I think I understood the description of the new Codeville merge, once I realised that your concept of "line" can treat two byte-for-byte-equal lines as distinct objects with distinct ordering properties. I also don't see why "diff" isn't sufficient to do your resolution. When you're doing resolution you only have to compare to a single ancestor; diff that against the live version, mark all the deleted lines as dead and treat the inserted lines as novel. The only thing unspecified is exactly where to insert the novel lines among the non-living lines; "as low as possible" probably works fine.

Where can I find out more about the weave format? Google is being somewhat unhepful.

Sat, May. 7th, 2005 12:08 pm (UTC)
(Anonymous): diff

Not sure what you mean by "diff isn't sufficient". There are some reasons to prefer something besides standard longest-common-subsequence to play the role of "diff", but it's more-or-less irrelevant to the merge algorithm what algorithm you decide to use here.

In the simple case, where you have only one parent, then yes, the algorithm you describe works. (Except note that "where to insert the novel lines" isn't unspecified; new lines are always larger than any old lines always go to the end.)

The nasty case is when you have two (or more, I suppose) parents, after a merge. Then you have to somehow align with both, while preserving the order invariants. That's what all the complexity in there is for...

-- Nathaniel

Sat, May. 7th, 2005 02:20 pm (UTC)
ciphergoth: Re: diff

Surely if you have two parents you merge them *first* then use the merged revision as the ancestor to diff against?

It sounds like I don't really understand what's going on. I hope I won't try your patience too much if I explain what I think is going on so you can tell me what my misconception is.

There's a set of line identifiers currently in use, and a total ordering on them.

For each line identifier that's in use, there's a string representing the contents of that line. Two identifers can map to the same string.

For any two identifiers in use, if there's no identifier in use that's between them, I can ask for a new identifier inbetween them and specify what string it maps to. Similarly I can ask for an identifier that's greater than or less than any existing identifier.

Given this, a revision is simply a partial map from the identifier set to a status of either "alive" or "dead". Lines not in the partial domain of that function are "unborn". And given that, merging is simply the creation of a new partial map from two existing partial maps using the rules set out above (dead wins against alive and unborn loses to both).

So merging synthesizes a new annotated revision from two existing annotated revisions.

The difficult problem is going from a sequence of strings to a revision as defined above. However, I only ever have to compare against one ancestor when I do this; if I have two or more ancestors, I can first perform the merge algorithm described to generate a single ancestor, and then use that.

This algorithm is really neat. It's a bad fit for how I had originally imagined a VCS should work, but that's not necessarily a point against it.

Sat, May. 7th, 2005 11:38 am (UTC)
(Anonymous): 3-way merge

Perhaps by "other SCMs", he just means "subversion". :-)

Or maybe he misinterpreted my comments on the limitations of monotone's current merge; I was trying to warn him in a high-level way (since he didn't care about VCS design back then) that monotone's merge algorithm had known bad behavior with long-lived parallel branches, and perhaps he guessed wrongly why this was so...

(For those tuning in from home, monotone currently uses a whole-tree 3-way merge, but not against the least common ancestor like git does, because that can cause silent corruption:
And for more problems with 3-way merge, derived from monotone's year+ of real world experience with it, see "3-way merge considered harmful":
This is what Bram means when he says whole tree three-way merging is "not workable even on paper". Though to be fair, it will work well enough for a while, until there are lots of people using git, at which point it will probably start subtly breaking things.)

(There is a theory which says that only crazy people work on VCS. There is another theory which says that the first theory gets the causality backwards.)

Sat, May. 7th, 2005 05:16 pm (UTC)
bramcohen: Re: 3-way merge

Yes, he may have thought that the problem was your cleverness on top of 3-way merge, rather than 3-way merge itself. It certainly is true that starting with a new fresh history is avoiding a lot of the merge problems for now. It's also avoiding the network protocol problems, which is an even bigger fudge, because we know for a fact that Git's current network protocol can't handle the kernel with full history in a reasonable manner.

He also might have spent an inordinate amount of time evaluating Subversion before figuring out what a piece of shit it is. That could easily give one the impression, concious or not, that the entire field is full of incompetents.

Working on SCMs certainly is crazy-making. I'm finding the highly mathematical nature of the new weave approach a lot more reassuring than I expected.

Sat, May. 7th, 2005 09:36 pm (UTC)
(Anonymous): Re: 3-way merge

Network protocol -- Yeah, the non-scalability of rsync is going to be a lot more obvious, sooner, than problems with merging.

Subversion -- I doubt it. People have been telling him to give up that BK junk for nice sweet Subversion for years, and he's been telling them to piss off for just as long.

Math is reassuring -- Yeah. I tend to say, if you work on SCM, you will go crazy. Nothing you can do about it; the best you can hope for is to choose which brand crazy you want. I went for "paranoia".

-- Nathaniel
Re: 3-way merge - (Anonymous) - Expand
Re: 3-way merge - (Anonymous) - Expand
Re: 3-way merge - (Anonymous) - Expand

Sat, May. 7th, 2005 11:05 pm (UTC)
invrnrv: VCS...

your codeville VCS and many others operate in the level of files, and within files, in the level of lines, is this correct? I'm not very knowledgable in VCS's but it seems that these systems do absolutely no parsing of the code itself. Then I don't understand why you are even attempting to find good ways to merge two versions of code that may have significant structural differences.

It would be better if there were a VCS server that my IDE can connect to while i'm programming, and if the VCS were able to detect possible conflicts. If I change the code in X manner, and you change it like Y, then if there are no conflicts when compiling X given Y and no conflicts when compiling Y given X, then the changes made in X and Y *may* be merged. Certainly if there is a conflict, then the VCS must alert you and me as to why there would be a conflict. (eg I call a function that you have just deprecated). The VCS can also allow coders to enter semantic change information, eg I can specify that I am changing the return characteristics of this function, so that VCS can alert coders who are calling this function. Even if X given Y and Y given X successfully compile, the compiler wouldn't understand semantic changes in code so it's up to the programmers to alert VCS about these.

Semantic and structural real time conflict resolving allows real time project changes and real time coder alerting of possible conflicts, so there is no reason to worry about trying to merge two files and figuring out which line should stay and such.

If a programmer so happens to have to code offline, the client side VCS should still record semantic and structural changes, so that when the user logs on later the VCS server is able to implement those changes, or alert the programmer that "certain changes are not possible unless you resolve these structural conflicts, and take note of these semantic changes that have been made while you were offline".

... or something like that.

- invrnrv

Sat, May. 7th, 2005 11:13 pm (UTC)
bramcohen: Re: VCS...

You're talking about using semantic information about the files being versioned, which while a reasonable idea in principle, introduces a huge amount of complexity in practice. Instead we just view everything as lines, and err on the side of extra conflicts when it comes time to decide what's broken. Yes, semantic mismatches can still happen on clean merge, but those are surprisingly rare, and fairly easy to detect and fix (usually just by doing a build).

There's a fair amount of academic work on semantic merging, none of us working on real systems really trust it at this point. It could potentially be useful though.

Sun, May. 8th, 2005 04:02 am (UTC)
invrnrv: Re: VCS...

what about the real-time aspect that i mentioned? it wouldn't be hard to figure out whether X given Y or Y given X creates more compile errors than the sum of X and Y alone, in which case there is a conflict. This should be relatively straightforward to implement. By alerting the coders while they're programming it resolves possible conflicts so there will be no trouble merging.

Also, it would be ideal to hook up the IDE to your VCS because that gives you a lot of information about what line went where.

i'm just rambling my 2 cents. take it for what its worth =]

Sun, May. 8th, 2005 10:27 am (UTC)
ciphergoth: Re: VCS...

One attractive thing about three-way-merge is that it means that the network protocol need know nothing of the file semantics used to implement merging - which in turn should mean that you can have arbitrarily sophisticated merges, using all sorts of clever semantic information, without making the whole VCS too complicated to live.

This is why I'm unhappy to learn that 3WM has such serious problems - a Codeville-like approach means that servers are exchanging semantic information about the history of every line in a file, which means that the network protocol is assuming "all the world's a line-oriented text format" in a very deep way. With something like Monotone, a clever client can merge C files one way, Python another, XML another and PDF another still without forcing the network protocol and storage formats to know about all these different file formats.

There seems to be talk of using Codeville's merge in Monotone. I'm curious to know how this would work - for each file, you'd have to crawl over many revisions, infer the anottated version anew, and from there do the merge. Could that be done efficiently enough?

Sat, May. 7th, 2005 11:38 pm (UTC)

(bram, you're surrounded in people who are using code as a linguistic expression of the poetic..
you gave me the idea to integrate 'your' protocols/projects into a multimedia project that i'm a part of - of which neil gaiman is creative consultant and daddy-figure - which characterizes vital truths and plays up, in part, the mythical nature of high technology and its parallels to archetypes of all time..

and now, back to your debate...)

Mon, May. 9th, 2005 01:12 pm (UTC)

This algorithm seems to merge versions rather than merging changes. Supposing I have A -> B -> C and I want to apply the bugfix in B -> C to version A without including the change A -> B - how can I do that using this merge algorithm? Is that what Linus means by "cherrypicking"?

If all he means is choosing the parts of B -> C he wants to merge, then that's trivial - create a synthetic version B' and consider the change to be B -> B' -> C, then merge B -> B'.

Mon, May. 9th, 2005 02:44 pm (UTC)

Yes, that's what's meant by cherry-picking.

There are two approaches to cherry-picking here - either pull in everything and add supressions to lines which shouldn't appear, or only pull in certain lines. I'm not sure which is the best approach on a technical level. I'm also not sure what the best approach to cherry picking is. See this post for my latest thoughts on the matter. Certainly supporting cherry-picking based on this algorithm will be entirely doable, but I'm still going over basic functionality and haven't worked it all out yet.

Mon, May. 9th, 2005 04:51 pm (UTC)

I think you can do cherrypicking as follows - (u)nborn, (l)ive, (d)ead. Including backward transactions for rollback...

uuu u
uul l
uud d
ulu u
ull u
uld d
udu u
udl u (?)
udd u
luu l
lul l
lud d
llu u
lll l
lld d
ldu u (?)
ldl l
ldd l
duu d
dul d (?)
dud d
dlu d (?)
dll d
dld d
ddu u
ddl l
ddd d

Mon, May. 9th, 2005 07:34 pm (UTC)

Your chart is a little inconsistent. There are two things we can go for here -

Pull in everything in C which isn't already in B

Remove everything in B which isn't in C

If we're doing the pull in approach, then anything which is d in A should remain d. If we're doing the removal approach, then ddu should result in u.

I think the first on is 'cherry-pick' and the second one is 'rollback'. The semantics of rollback are such that we'd really like rollbacks to stay rolled back on later merges, so this simplistic approach fails for what we'd like that functionality to do.

If we go with straight cherry-picking, we get the following table, for which I also included what side won for the purposes of detecting conflicts:

uuu u
uul l >
uud d >
ulu u
ull u <
uld d >
udu u
udl u <
udd u <
luu l <
lul l
lud d >
llu l <
lll l
lld d >
ldu l <
ldl l
ldd l <
duu d <
dul d <
dud d
dlu d <
dll d <
dld d
ddu d <
ddl d <
ddd d

The line which gives me pause is uld. The question is - if a section is overwitted in B, then re-overwritten in C, should the result of cherry-picking C-B then merging with B be to drop B's change or conflict? My current feeling is that it should just drop B's change, which means that d is the right answer, but I'm not 100% certain that u isn't the correct behavior.

Thanks for pointing out this approach, it hadn't occured to me. My brain still hasn't made the full paradigm shift to the weave way of thinking.