You are viewing bramcohen

Sat, Apr. 23rd, 2005, 10:16 pm
Version Control Shenanigans

For those of you who have been following the recent Bitkeeper shenanigans, I'm now going to give the inside scoop on what's happening in free distributed version control.

There's one main interface to distributed version control, which is essentially how Darcs, Monotone, and Codeville all work. Arch and its kin all have a much klunkier UI which basically makes them a non-contender for projects like the Linux kernel. There are some other potentially competitive projects, such as Vesta, which is extremely mature and powerful but currently doesn't do merging, Bazaar-ng, which is currently mostly vaporware, and svk, which I think has recently switched from an arch-style to a darcs-style interface, but I can't say for sure. There are probably other systems worth at least a mention, but for now I'll just stick to discussing the (to my knowledge) most mature and promising systems.

A new system is Git (more of a version control back end than a version control system), which Linus Torvalds hacked together and initially said was a 'stop-gap' measure, but now appears to be getting quite excited about, and is thinking may be a good long-term solution. Git was originally supposed to be simple and fast, except new developments are heading it in the direction of being not so simple and not so fast, and its network protocol is currently a disaster. The good news is that Git is basically a ripoff of a bunch of the architectural ideas behind Monotone, which are good ideas, so it can't be as big of a disaster as, say, Subversion, but it's currently extraordinarily similar to a very old version of Monotone, which makes parallel development of Git and Monotone seem like a waste of resources. The initially stated reason for the new system was that Monotone was too slow, but just a little bit of optimization has made Monotone many times faster than it was before, and most of the remaining performance difference is caused by sanity checks, which can simply be turned off. There are perfectly reasonable long-term strategies which involve separate Git development for the time being, and I'll get to those later, but we in the distributed version control world really have no idea what Linus is thinking at this point.

Darcs, Monotone, and Codeville implement one each of the known ways to approach distributed merge - patch commutation, three-way merge, and two-way merge with history, respectively. Most other projects wind up using three-way merge, although it's the least powerful of the three. Darcs and Codeville were both motivated by the invention of their merge algorithms, and neither of the approaches has to my knowledge been invented independently.

Now, for some comparisons -

Darcs is lacking in hash-based history, which means that past versions can't be reproduced and that an interloper could easily change what the history says. This is a major missing feature. There's currently talk of making Darcs use Git as a back-end, which would give it the hash-based history, but Git's whole-file view of the universe isn't a terribly good match for Darcs's patch-based one. I think a more Codeville-like history would be a much better fit, but doing inference of patches on the client side based on the history would be a workable solution. Darcs also suffers from containing some extremely bad asymptotic runtimes in algorithms which it uses, which turn up in not terribly uncommon cases. This is currently being worked on, but is an area of actual research rather than simple optimization. Because of these problems, Darcs isn't really ready for prime-time yet, but may be in the not too distant future. Darcs's big advantage right now is that it have very good extensive support for cherry picking, a feature which is planned for Codeville but not implemented yet, and I'm not sure if it's planned for Monotone.

Monotone is a fairly mature, mostly traditional three-way-merge-based system with a hash-based history. It has decent network protocol and rudimentary (but far from complete) support for renames. (Git doesn't have support for renames, a hard-to-change architectural decision which was made when it was supposed to be a quick hack temporary solution). Monotone's merge algorithm isn't as good as Codeville's, and there's been some talk of making Monotone use Codeville's merge algorithm, but that's an involved topic which noone's sure the future of. Monotone also supports some nice certification functionality, whose importance is unclear and which could be added to other systems. A hash-based history gets a lot of security to begin with, and the certs don't carry over between format changes, so they're causing a fair amount of possibly unnecessary pain for the time being.

Codeville is also fairly mature, and is having the last few rough edges polished up right now. It has a good network protocol, (technically, it will in about two weeks), good (but not quite complete) support for renames, a well-done hash-based history, and the best merging of any available system. There's a subtle architectural distinction in the history approach of Monotone and Codeville - Monotone records the secure hashes of all old versions, while Codeville records the changes from the old hashed versions. Monotone's approach is less simple in the end. The problem is that for efficient transfer over the wire you need to pass deltas rather than full copies, so you need to cache the deltas, or generate them on the fly, and they need to be integrity checked as they come down, which means a whole lot of hash checking of intermediate versions. The on paper advantage of storing complete copies is that if you cache full complete copies on disk you don't need to run regeneration code when making a new checkout, but that proves to be invalid in practice because the operation's performance is completely dominated by the number of hard drive seeks it has to do, which results in some weird results like a Codeville checkout being faster than a cp -a. Number of seeks optimization is an interesting subject which is beyond the scope of this entry.

As you've probably gathered, Git's quick hack nature is readily apparent, even ignoring that it hasn't even gotten started on implementing merge yet. A hopeful sign is the development of Cogito, which is a front end to Git with a reasonable interface. If everyone starts using Cogito, then it would be a simple matter to make a Codeville- or Monotone- based back end which was command identical to Git, but also had renames, and a non-sucky network protocol, and decent merging. That depends, of course, on people actually using Cogito as their standard Git interface, which is mostly dependant on what Linus wants, and like I said, we don't currently have any idea what Linus is thinking.

Monotone and Codeville have been growing closer over time. Whether they'll reach a complete unification at some point is an involved topic which hasn't been fully explored.

You can read much of the discussion which has been happening around version control systems at loglibrary. #codeville isn't logged yet though.

The old Linux kernel history is now readily available from SourcePuller, which was written by Tridge. It was the writing of this (extremely simple) script which caused the BitKeeper license to get yanked. Ironically, SourcePuller has helped with damage control from the BitKeeper license yanking by making the full old history (not just the linearized version available from CVS) be available. Also ironically, Git is currently using rsync for its network protocol, and rsync was written by Tridge. Rsync is actually quite dated, and the wrong tool for that particular job in any case, but that's a whole other subject.

Full disclosure, of course, is that I'm the founder of Codeville and a current contributor (although most of the work these days is done by Ross). While this makes me a bit biased, it's also resulted in me having a much better sense of what's going on.

By coincidence, the Monotone and Codeville web sites are hosted on the same server.

I could go more into some of the personalities involved and implementation details of some of the systems (like what languages they use) but I've already spent way more time on this than intended, so that's all for now.

Sun, Apr. 24th, 2005 09:41 pm (UTC)
bramcohen: Re: Some context...

I have looked at BitKeeper, at least to the extent that I'm allowed to, and it appears to have very much the same interface that Codeville/Darcs/Monotone do. I'm not sure what features Linus really wanted to begin with (there was much talk of involved cherry-picking things which Linus wanted) but Linus has recently said that BitKeeper made him change his process - basically by changing it into the simple 'pull everything' process.

As for the growing complication and declining performance of git, this appears to be a fiction created by the author

You are simply wrong. Git is a weekend hack which looks like a weekend hack, and everything else started out looking like a weekend hack as well.

Monotone got a lot faster in a weekend simply because noone had bothered to try to optimize it for projects the size of the kernel before that. Nothing mysterious there.

My big beef with Linus right now is that he's applying a ridiculous double-standard to all the other projects and his hack project. Git has only one benchmark which it does well at, and ranges between awful and unimplemented for a whole bunch of others, and yet he's proclaiming it usable because it can do a checkout in a few seconds on his own machine, which happens to have gigabytes of memory and a hot cache.

As for git, either hop on board or compete head to head. Either way, 'shut up and code'.

You seem to have missed that I'm doing exactly that.

Mon, Apr. 25th, 2005 05:14 am (UTC)
(Anonymous): Re: Some context...

You wanted him to say that distributed SCMs were acceptable three years ago? You know exactly what Linus wants because it is what he coded up, after using what was out there at the time. And to say that renames not being supported because it's a quick hack and not because it was a design decision is very misleading. Linus has made his point very clear about where renames belong.

I think git has been a good thing. For us normal ppl: we get git. For kernel developers: they get git. For distributed SCM developers: the field is still open/no "linus uses us" winners yet; and for everyone: the distributed SCM folks get a kick in the butt.

Mon, Apr. 25th, 2005 05:47 am (UTC)
bramcohen: Re: Some context...

You wanted him to say that distributed SCMs were acceptable three years ago?

I honestly can't fathom what in my comment could give you that impression. Git is being compared with the system out there today, not the system out there threee years ago.

And to say that renames not being supported because it's a quick hack and not because it was a design decision is very misleading. Linus has made his point very clear about where renames belong.

I saw a post where Linus said, in so many words, that it would be silly to add renames to such a simplistic system as Git. If you've got a link where he said something else I'd honestly like to see it.

For us normal ppl: we get git. For kernel developers: they get git.

Once Git gets a real network protocol and a merge algorithm, you won't any more. Your (rather ridiculous) assumption that Git is anywhere near the same level of usability as the other systems is totally off-base.

Mon, Apr. 25th, 2005 10:05 am (UTC)
(Anonymous): Re: Some context...

Renames: If you're talking about this post:
http://www.gelato.unsw.edu.au/archives/git/0504/0155.html
I read it differently to you. I think Linus not saying that renames shouldn't be added, but that they shouldn't be added in a way which causes layer violation.

However, I think Linus may have made a mistake in reviewing all the other free scm systems while he was still pissed off.

Mon, Apr. 25th, 2005 03:09 pm (UTC)
(Anonymous): Re: Some context...

This is why I thought you were talking about the original choice of BK:
"I'm not sure what features Linus really wanted to begin with"

Well, I like this as his statement why renames are not in git: http://www.gelato.unsw.edu.au/archives/git/0504/0248.html

I don't think the git has the capabilities that other SCMs have, but I will state that it's near the functionality that some people need (read: miminally linus), and a good userbase. As for me, I was thinking git would be interesting foundation for a type of googlefs (distributed/redundant/versioned filesystem), not as an SCM.

Thu, Apr. 28th, 2005 10:01 pm (UTC)
pavelmachek: Re: Some context...

Is there easy way to get linux kernel in darcs/monotone/codevile format? It would be nice to play with it that way.

git/cogito starts to be usable thanks to Linus providing changes that way... It is better than patch & merge at least.

Tue, Apr. 26th, 2005 06:22 pm (UTC)
(Anonymous): Re: Some context...

I don't think Linus does care about how it does with other benchmarks, as the only thing he wants is something fast for the task he spent the whole day doing. GIT is super-optimized for his "hot path", you can't blame him for liking GIT.

From the messages I've read, he doesn't even care about being a real SCM or not. GIT is not a real SCM neither was designed to be one, but it does one thing well (supporting renames in GIT would have been a _BIG_ and ugly hack), and IMO Linus' mind is too "unixy" to change his GIT tool for something slower which does tons of things he doesn't gives a shit about.

Tue, Apr. 26th, 2005 06:27 pm (UTC)
bramcohen: Re: Some context...

Perhaps I haven't made this point clear enough, but the benchmark which Linus is excited about is something which almost anything can be made to do just as well.