You are viewing bramcohen

Sun, Apr. 17th, 2011, 06:56 pm
Git Can't Be Made Consistent

This post complains about Git lacking eventual consistency. I have a little secret for you: Git can't be made to have eventual consistency. Everybody seems to think the problem is a technical one, of complexity vs. simplicity of implementation. They're wrong. The problem is semantics. Git follows the semantics which you want 99% of the time, at the cost of having some edge cases which it's inherently just plain broken on.

When you make a change in Git (and Mercurial) you're essentially making the following statement:

This is the way things are now. Forget whatever happened in the past, this is what matters.


Which is subtly and importantly different from what a lot of people assume it should be:

Add this patch to the corpus of all changes which have ever been made, and are what defines the most recent version.


The example linked above has a lot of extraneous confusing stuff in it. Here's an example which cuts through all the crap:

  A
 / \
B   B
|
A


In this example, one person changed a files contents from A to B, then back to A, while someone else changed A to B and left it that way. The question is: What to do when the two heads are merged together? The answer deeply depends on the assumed semantics of what the person meant when they reverted back to A. Either they meant 'oops I shouldn't have committed this code to this branch' or they meant 'this was a bad change, delete it forever'. In practice people mean the former the vast majority of the time, and its later effects are much more intuitive and predictable. In fact it's generally a good idea to make the separate branch with the change to B at the same time as the reversion to A is done, so further development can be done on that branch before being merged back in later. So the preferred answer is that it should clean merge to B, the way 3 way merge does it.

Unfortunately, this decision comes at significant cost. The biggest problem is that it inherently gives up on implicit cherry-picking. I came up with some magic merge code which allowed you to cut and paste small sections of code between branches, and the underlying version control system would simply figure out what you were up to and make it all work, but nobody seemed much interested in that functionality, and it unambiguously forced the merge result in this case to be A.

A smaller problem, but one which seems to perturb people more, is that there are some massively busted edge cases. The worst one is this:

  A
 / \
B   B
|   |
A   A


Obviously in this case both sides should clean merge to A, but what if people merge like this?

  A
 / \
B   B
|\ /|
A X A
|/ \|


Because of the cases we just went over, they should clean merge to B. What if they are then merged with each other? Since both sides are the same, there's only one thing they can merge to: B

  A
 / \
B   B
|\ /|
A X A
|/ \|
B   B
 \ /
  B


Hey, where'd the A go? Everybody reverted their changes from B back to A, and then via the dark magic of merging the B came back out of the ether, and no amount of further merging will get rid of it again!

The solution to this problem in practice is Don't Do That. Having multiple branches which are constantly pulling in each others's changes at a slight lag is bad development practice anyway, so people treat their version control system nicely and cross their fingers that the semantic tradeoff they made doesn't ever cause problems.

Mon, Apr. 18th, 2011 06:38 pm (UTC)
faceted_jacinth

> In this example, one person changed a files contents from A to B, then back to A, while someone else changed A to B and left it that way. The question is: What to do when the two heads are merged together?

Report a merge conflict.

Going back to the basics: what are the semantics of a 3-way merge? Well, we have three snapshots, the base and two branches. We make diffs between the base and the branches. What it actually means is that we reconstruct the programmers' actions from the snapshots: they added these lines, modified these lines, deleted these lines.

Then we either merge the diffs and produce the merged snapshot, or discover a conflict: both programmers modified the same line, or one deleted the line another modified, or both added some lines in the same place.

And if we look at this this way, then the root of the problem is that git (as well as Mercurial, SVN, etc) take a shortcut when computing the diffs to be fed into the 3-way merge, and that sometimes produces incorrect/inconsistent results.

An example of git doing it wrong: http://pastebin.com/SxmwpFkY

If I run the the script, switch to master and do "git diff master~2", git produces an incorrect diff:
@@ -3,3 +3,11 @@ B
 C
 D
 E
+G
+G
+G
+A
+B
+C
+D
+E
I mean, it's a correct diff between the two snapshots, but this is not what I did.

But when I run "git blame", it produces the correct "diff":
6584854c 1) A
6584854c 2) B
6584854c 3) C
6584854c 4) D
6584854c 5) E
c36e1ff8 6) G
c36e1ff8 7) G
c36e1ff8 8) G
^b43cba4 9) A
^b43cba4 10) B
^b43cba4 11) C
^b43cba4 12) D
^b43cba4 13) E

Here it examines all commits leading to the current one, and deduces the position of the lines from the initial commit (^b43cba4) correctly.

And that's it. Normal diffs are not associative, given successive snapshots a, b, c, diff(diff(a, b), c) != diff(a, diff(b, c)) != diff(a, c). While the output of `blame` is associative (except maybe for deleted lines).

So it seems that if git (and hg, and svn) used the output of a blame-like algorithm for merging, then the order of automatic merges wouldn't matter. And the problem becomes purely technical one: how to make this blame-like algorithm fast enough.

(by the way, of course the basic step -- that reconstruction of programmer's actions from the difference between the snapshot, is not infallible. But when you use it on two adjacent snapshots it's good enough, also pretty transparent. As the distance between the snapshots being diffed (by the number of intermediate commits) increases, the probability of getting it wrong grows).

Mon, Apr. 18th, 2011 07:15 pm (UTC)
bramcohen

Showing a conflict in that example would be clearly broken behavior. It could result in repeatedly showing the exact same merge conflict over and over again, between the exact same values, on later merges.

Mon, Apr. 18th, 2011 07:46 pm (UTC)
faceted_jacinth

> It could result in repeatedly showing the exact same merge conflict over and over again, between the exact same values, on later merges.

Why? Any merge establishes definitive ancestries for each line of code, and when you are merging "A - B - A" with something, you are supposed to tell it that the conflicting lines come from the base snapshot, not from your "reversal". In fact, when you want to revert a commit, instead of re-committing the previous version you should merge with it, I think.

Was that the problem that you were thinking about?

Thu, Apr. 21st, 2011 06:11 pm (UTC)
bramcohen

Merging with an old version to undo changes just plain doesn't make sense - the whole point of history is that it knows that later versions supercede older versions.

Thu, Apr. 21st, 2011 09:34 pm (UTC)
faceted_jacinth

Also, to clarify: here a merge would be quite different from an ordinary commit but with two parents, here it would have to store the choices the user made, from which of the parents he picked the lines. Plus maybe some additional modifications he had to make, as a separate commit.