Log in

No account? Create an account

Sun, Apr. 17th, 2011, 06:56 pm
Git Can't Be Made Consistent

This post complains about Git lacking eventual consistency. I have a little secret for you: Git can't be made to have eventual consistency. Everybody seems to think the problem is a technical one, of complexity vs. simplicity of implementation. They're wrong. The problem is semantics. Git follows the semantics which you want 99% of the time, at the cost of having some edge cases which it's inherently just plain broken on.

When you make a change in Git (and Mercurial) you're essentially making the following statement:

This is the way things are now. Forget whatever happened in the past, this is what matters.

Which is subtly and importantly different from what a lot of people assume it should be:

Add this patch to the corpus of all changes which have ever been made, and are what defines the most recent version.

The example linked above has a lot of extraneous confusing stuff in it. Here's an example which cuts through all the crap:

 / \
B   B

In this example, one person changed a files contents from A to B, then back to A, while someone else changed A to B and left it that way. The question is: What to do when the two heads are merged together? The answer deeply depends on the assumed semantics of what the person meant when they reverted back to A. Either they meant 'oops I shouldn't have committed this code to this branch' or they meant 'this was a bad change, delete it forever'. In practice people mean the former the vast majority of the time, and its later effects are much more intuitive and predictable. In fact it's generally a good idea to make the separate branch with the change to B at the same time as the reversion to A is done, so further development can be done on that branch before being merged back in later. So the preferred answer is that it should clean merge to B, the way 3 way merge does it.

Unfortunately, this decision comes at significant cost. The biggest problem is that it inherently gives up on implicit cherry-picking. I came up with some magic merge code which allowed you to cut and paste small sections of code between branches, and the underlying version control system would simply figure out what you were up to and make it all work, but nobody seemed much interested in that functionality, and it unambiguously forced the merge result in this case to be A.

A smaller problem, but one which seems to perturb people more, is that there are some massively busted edge cases. The worst one is this:

 / \
B   B
|   |
A   A

Obviously in this case both sides should clean merge to A, but what if people merge like this?

 / \
B   B
|\ /|
|/ \|

Because of the cases we just went over, they should clean merge to B. What if they are then merged with each other? Since both sides are the same, there's only one thing they can merge to: B

 / \
B   B
|\ /|
|/ \|
B   B
 \ /

Hey, where'd the A go? Everybody reverted their changes from B back to A, and then via the dark magic of merging the B came back out of the ether, and no amount of further merging will get rid of it again!

The solution to this problem in practice is Don't Do That. Having multiple branches which are constantly pulling in each others's changes at a slight lag is bad development practice anyway, so people treat their version control system nicely and cross their fingers that the semantic tradeoff they made doesn't ever cause problems.

Mon, Apr. 18th, 2011 07:20 am (UTC)
MrJoy: You are missing the point.

Git does not attempt to achieve "eventual consistency" because that requires an authoritative notion of The Truth, and Git was explicitly designed to AVOID that.

This is not a design flaw; this is the result of explicitly avoiding what you want.

Authority, in Git, is a *social* matter, not a technical one. And blind merging is explicitly frowned upon -- you should know what the heck you're merging, and be able to make intelligent decisions about it.

But in the event you want to simply disregard a branch, and make a different one "win" when doing a merge, git DOES provide that mechanism: "git merge -s ours".

In short: Stop treating Git like SVN -- it was never meant to be that, and projecting your desires onto it does not make it's unwillingness to meet those desires a "flaw"! If you can't suss out how to establish a social structure that provides the authoritative "single source of truth" you want -- or find it distasteful that you need to address that at the social level, then clearly Git does not meet your needs and you should find a tool that does.

That said, nitpicking about history-tracking as a lament for why what you want will never be -- it just misses the point. Badly. Git will never be what you want because it was designed to NOT be that, period.

Mon, Apr. 18th, 2011 08:26 am (UTC)
bramcohen: Re: You are missing the point.

If you find it so important to not have blind merging, I suggest you do all merges by using diff and manually selecting which hunk wins in each case. That will free you from the confines of a tool which actually keeps track of things for you.

Mon, Apr. 18th, 2011 08:42 am (UTC)
MrJoy: Re: You are missing the point.

Whenever I conduct a merge, I DO use three-way merge tools, but I have learned to ALWAYS ALWAYS ALWAYS sanity-check the conclusions it reaches, hunk-by-hunk.

Except of course in the rare case I want to close out a branch outright and denote that in the history, and then I do a "git merge -s ours" to denote it as such.

You may also find "git rerere" relevant when dealing with repetitious merges that might otherwise get history confused.

However, ultimately, when I find such situations arise I usually conclude it is time to start reminding people that Git is about "patches as communication", and encourage them to act accordingly. I invariably see a decrease in ill-conceived and ill-executed merges that may or may not be doing what people *expect*, a reduction in spurious commits that only communicate minutiae of an individual's particular workflow (and not meaningful information about what the individual set out to *achieve*), and so forth.

Problems like the on you describe are a symptom of people abdicating their responsibility as engineers to The Machine. Now, I am perfectly happy to let machines do for me what they can do better than I, (garbage collection + references > pointers, for example) but when an engineer accepts changes into the codebase he or she is working on without even being AWARE of what those changes are -- a situation that happens by definition when one leaves merging to an automated tool -- then the engineer is inviting a world of hurt upon him or herself.

Know what's in your codebase, or expect it to be broken often and unexpectedly. Communicate upstream in the form of cogent, terse patches that express the whole of a change, and nothing but that change.

To do otherwise necessarily requires one of two options: A) Coordinate early and often to minimize the risk of deviation (I.E. don't branch), or B) Fall into irreconcilable chaos. SVN users are familiar with both. Git users are familiar with A only by virtue of bad habits learned in SVN-land, and familiar with B only by virtue of the worst form of laziness. (I.E. not the kind Larry Wall spoke of...)

Mon, Apr. 18th, 2011 08:56 am (UTC)
bramcohen: Re: You are missing the point.

You seem to enjoy spending time messing around with your version control system.

Mon, Apr. 18th, 2011 09:05 am (UTC)
MrJoy: Re: You are missing the point.

Nope, I enjoy having code that goes from working state to working state, and not getting blindsided accidentally.

Wed, Apr. 20th, 2011 11:59 am (UTC)
ciw88: Re: You are missing the point.

i'm confused.

i think the original point was that git makes random decisions when merging, which forces the git user to look at the result and make sure git's making those random choices to everyone's liking.

the preferred situation would be that a version control system is deterministic and consistent, i.e. that given the same situation it will always do the same thing. if that thing is not what you want, then you will know beforehand, and can fix it. if it is, you don't have to check, you just know it's going to be right.

the drawback of consistency is that it's hard to achieve, and it's harder to understand what happens if you don't follow it with your own eyes hunk by hunk (which you can still do btw). the benefit is that if you're smart enough to use your tools correctly, being consistent they give you a lot more power and efficiency. because you don't have to keep making sure they do the right thing every step of the way.

now what has that to do with social contracts and multiple points of truth? if git were consistent, how would that change the balance of control and authority in a team using it? doesn't consistency, rather than limiting your choices of what hunk goes where, only make it easier for you to implement those choices?

i like git a lot, btw. (-:

Mon, Apr. 18th, 2011 07:45 am (UTC)

I think Monotone's mark-merge escalates all these decisions to the user, FWIW.

Mon, Apr. 18th, 2011 08:33 am (UTC)

I believe you're correct, and also that mark-merge will get horribly over-conservative in situations where two different branches keep pulling old versions of the other one for an extended period of time. It seems like nothing supports that use case well, and noone has ever really complained about it.

Mon, Apr. 18th, 2011 08:47 am (UTC)
MrJoy: Complaining about pathological insanity...

Why WOULD they complaint about it?

What you describe is a pathological scenario wherein the only fathomable explanation would be people making random guesses about what does or does not actually *work*. You are describing monkeys at keyboards.

There simply is not a good reason to go back and forth between historic versions repeatedly with no awareness of why one is doing it, and whether or not one's reason for doing it should override the judgements of others.

Mon, Apr. 18th, 2011 08:58 am (UTC)
bramcohen: Re: Complaining about pathological insanity...

You shouldn't be so sharp in your judgements unless you actually understand what is being discussed, which in this case you clearly don't.

Mon, Apr. 18th, 2011 09:04 am (UTC)
MrJoy: Re: Complaining about pathological insanity...

I've seen the scenarios you describe arise, and in every case it's been a matter of people pushing buttons and hoping for the best. I have never seen a case where people intended to flip-flop between versions and then got hung up because the merge didn't do what they expected and they couldn't clearly and simply say "y'know what, screw B, I want A to 'win'.".

You either know what you expect to happen to the code when you do a merge, or you do not. If you know what you expect, git provides plenty of tools to help you ensure that you get what you expect. If you don't actually know what to expect and are blindly hoping for the best -- then you have no clear sense of your codebase, or the changes being made by others (either the mechanics or the intent).

Mon, Apr. 18th, 2011 02:43 pm (UTC)
bramcohen: Re: Complaining about pathological insanity...

You've both sneered at me for claiming that claiming that there's usually a unique latest common ancestor in practice, saying that I'm coming at it from SVN experience, and said that anyone following a methodology which doesn't result in a latest common ancestor is engaged in pathological insanity. You've now covered everything which someone might do as being crazy, and your only claim is that Git somehow has magical pixie dust to make neither of these cases happen, even though there are no other cases possible.

I'm going to start deleting further comments from you unless you start understanding what's being discussed and drop the Git fanboyism. You really aren't contributing anything to the conversation.

Wed, Apr. 20th, 2011 06:34 am (UTC)
mcandre: Re: Complaining about pathological insanity...

Computers exist to perform computations, to do them automatically, scalably, and most importantly, predictably. Regardless of how an edge case manifests, we need computer software to resolve that edge case consistently. Most computer code isn't directly connected to bumbling humans but rather connected via a long daisy chain of code to other code, and finally a user interface. Fuckups like the failure at Dhahran happen when the properties of a system are taken for granted.
(Deleted comment)

Mon, Apr. 18th, 2011 08:39 am (UTC)
bramcohen: Re: Consistency

I believe the terms you would like are 'commutative and associative'. What I mean by 'eventual consistency' is that if everybody eventually pulls in all the same history, they'll all wind up at the same value (assuming no merge conflicts). At least that's what I think I mean, I'm just trying to use the same terminology as earlier posts.

I suspect that every layer is a spoiler for eventual consistency by the way. It just plain won't happen unless you do some very artificial canonical reordering of changes when new history comes in, and that can result in bizarre codebase jumping around in some edge cases.

Mon, Apr. 18th, 2011 06:32 pm (UTC)

Your sixth sentence would read much better as:
Git follows the semantics which you want 99% of the time, at the cost of having some edge cases upon which it's inherently just plain broken.

Mon, Apr. 18th, 2011 07:07 pm (UTC)

Those are grammatical rules up with which I will not put.

Mon, Apr. 18th, 2011 06:38 pm (UTC)

> In this example, one person changed a files contents from A to B, then back to A, while someone else changed A to B and left it that way. The question is: What to do when the two heads are merged together?

Report a merge conflict.

Going back to the basics: what are the semantics of a 3-way merge? Well, we have three snapshots, the base and two branches. We make diffs between the base and the branches. What it actually means is that we reconstruct the programmers' actions from the snapshots: they added these lines, modified these lines, deleted these lines.

Then we either merge the diffs and produce the merged snapshot, or discover a conflict: both programmers modified the same line, or one deleted the line another modified, or both added some lines in the same place.

And if we look at this this way, then the root of the problem is that git (as well as Mercurial, SVN, etc) take a shortcut when computing the diffs to be fed into the 3-way merge, and that sometimes produces incorrect/inconsistent results.

An example of git doing it wrong: http://pastebin.com/SxmwpFkY

If I run the the script, switch to master and do "git diff master~2", git produces an incorrect diff:
@@ -3,3 +3,11 @@ B
I mean, it's a correct diff between the two snapshots, but this is not what I did.

But when I run "git blame", it produces the correct "diff":
6584854c 1) A
6584854c 2) B
6584854c 3) C
6584854c 4) D
6584854c 5) E
c36e1ff8 6) G
c36e1ff8 7) G
c36e1ff8 8) G
^b43cba4 9) A
^b43cba4 10) B
^b43cba4 11) C
^b43cba4 12) D
^b43cba4 13) E

Here it examines all commits leading to the current one, and deduces the position of the lines from the initial commit (^b43cba4) correctly.

And that's it. Normal diffs are not associative, given successive snapshots a, b, c, diff(diff(a, b), c) != diff(a, diff(b, c)) != diff(a, c). While the output of `blame` is associative (except maybe for deleted lines).

So it seems that if git (and hg, and svn) used the output of a blame-like algorithm for merging, then the order of automatic merges wouldn't matter. And the problem becomes purely technical one: how to make this blame-like algorithm fast enough.

(by the way, of course the basic step -- that reconstruction of programmer's actions from the difference between the snapshot, is not infallible. But when you use it on two adjacent snapshots it's good enough, also pretty transparent. As the distance between the snapshots being diffed (by the number of intermediate commits) increases, the probability of getting it wrong grows).

Mon, Apr. 18th, 2011 07:15 pm (UTC)

Showing a conflict in that example would be clearly broken behavior. It could result in repeatedly showing the exact same merge conflict over and over again, between the exact same values, on later merges.

Mon, Apr. 18th, 2011 07:46 pm (UTC)

> It could result in repeatedly showing the exact same merge conflict over and over again, between the exact same values, on later merges.

Why? Any merge establishes definitive ancestries for each line of code, and when you are merging "A - B - A" with something, you are supposed to tell it that the conflicting lines come from the base snapshot, not from your "reversal". In fact, when you want to revert a commit, instead of re-committing the previous version you should merge with it, I think.

Was that the problem that you were thinking about?

Thu, Apr. 21st, 2011 06:11 pm (UTC)

Merging with an old version to undo changes just plain doesn't make sense - the whole point of history is that it knows that later versions supercede older versions.

Thu, Apr. 21st, 2011 09:34 pm (UTC)

Also, to clarify: here a merge would be quite different from an ordinary commit but with two parents, here it would have to store the choices the user made, from which of the parents he picked the lines. Plus maybe some additional modifications he had to make, as a separate commit.

Mon, Apr. 18th, 2011 06:52 pm (UTC)
zooko: "confusing crap" vs. "nice example"

Dear Bram:

One man's "confusing crap" is another man's "nice example". I have always found your revert-based example to be, while technically interesting, not the sort of thing that I imagine running into a lot in practice. (Like you say, If it hurts when you do that then don't do that!) Also I tend to get confused when you get to the criss-cross scenario.

On the other hand my bugfix-based example that you link to at the top illustrates an issue that is relevant to pretty much every merge. The only reason people don't notice it in practice is that usually the "fuzzy target selection" algorithm gets lucky. That's the one in which you search for a hunk in the target which is near where the original hunk was located or has some of the same neighboring lines of code as the original hunk had.

Anyway, I'm kind of irritated that you alluded to my nice example (or possibly to Russell O'Connor's extension of it) as "confusing" and "crap". If you can think of a simplification or a clarification of the bugfix-based example, I would be interested to see it. Your revert-based example is not that, though--it is a different thing.



Mon, Apr. 18th, 2011 07:05 pm (UTC)
bramcohen: Re: "confusing crap" vs. "nice example"

The problem with dealing directly with the positioning example is that my argument is completely semantic, I'm basically saying 'maybe the user really did mean for it to be a completely fresh version, and just ignore the history'. Which is basically an argument in favor of fuzzy matches in general. Examples where there's a lot more editing in the interim make it much more likely that the user really didn't follow all the line moves and simply wants the fuzzy match.

My point here is about the higher-level thesis - that consistent merges are just plain impossible. An argument can be made for it in the line ordering case as well, but it's a weaker one, hence my use of just this example.

Mon, Apr. 18th, 2011 07:21 pm (UTC)
zooko: Re: "confusing crap" vs. "nice example"

Now you're making a good argument. (Unlike off-handed words like "confusing" and "crap".)

I don't yet see if it is a correct good argument, though. I don't see a situation in which the user wants a fuzzy match for the merge. Every edge in the graph represents a diff that a specific user approved. This conversation is not about the fuzziness inherent in the production of those diffs, right? (That is a different but related issue so it can confuse discussion.)

So with this graph:

   / \
  b1  c1

assuming that the user who generated the diff from a->c1 generated the diff they intended to, and the user who generated the diff from a->b1 did the same, and the user who generated the diff from b1->b2 did the same, then I don't think the user who asks for the merge of the two branches would ever want the fuzzy solution which ignores the a->b1 edge and the b1->b2 edge in favor of using just the a, c2, and b2 states.

Mon, Apr. 18th, 2011 08:21 pm (UTC)

On Reddit, a user quite reasonably asks:
Having multiple branches which are constantly pulling in each others's changes at a slight lag is bad development practice anyway
Wait, ain't that the scenario for which DVCS are meant?

Are we misunderstanding what you meant?

Tue, Apr. 19th, 2011 12:07 pm (UTC)

You should generally have a master, and have branches pull from master frequently and synch to master occasionally, and when they do synch to master it should be off the most recent version.