Home

Sat, Apr. 23rd, 2005, 10:16 pm
Version Control Shenanigans

For those of you who have been following the recent Bitkeeper shenanigans, I'm now going to give the inside scoop on what's happening in free distributed version control.

There's one main interface to distributed version control, which is essentially how Darcs, Monotone, and Codeville all work. Arch and its kin all have a much klunkier UI which basically makes them a non-contender for projects like the Linux kernel. There are some other potentially competitive projects, such as Vesta, which is extremely mature and powerful but currently doesn't do merging, Bazaar-ng, which is currently mostly vaporware, and svk, which I think has recently switched from an arch-style to a darcs-style interface, but I can't say for sure. There are probably other systems worth at least a mention, but for now I'll just stick to discussing the (to my knowledge) most mature and promising systems.

A new system is Git (more of a version control back end than a version control system), which Linus Torvalds hacked together and initially said was a 'stop-gap' measure, but now appears to be getting quite excited about, and is thinking may be a good long-term solution. Git was originally supposed to be simple and fast, except new developments are heading it in the direction of being not so simple and not so fast, and its network protocol is currently a disaster. The good news is that Git is basically a ripoff of a bunch of the architectural ideas behind Monotone, which are good ideas, so it can't be as big of a disaster as, say, Subversion, but it's currently extraordinarily similar to a very old version of Monotone, which makes parallel development of Git and Monotone seem like a waste of resources. The initially stated reason for the new system was that Monotone was too slow, but just a little bit of optimization has made Monotone many times faster than it was before, and most of the remaining performance difference is caused by sanity checks, which can simply be turned off. There are perfectly reasonable long-term strategies which involve separate Git development for the time being, and I'll get to those later, but we in the distributed version control world really have no idea what Linus is thinking at this point.

Darcs, Monotone, and Codeville implement one each of the known ways to approach distributed merge - patch commutation, three-way merge, and two-way merge with history, respectively. Most other projects wind up using three-way merge, although it's the least powerful of the three. Darcs and Codeville were both motivated by the invention of their merge algorithms, and neither of the approaches has to my knowledge been invented independently.

Now, for some comparisons -

Darcs is lacking in hash-based history, which means that past versions can't be reproduced and that an interloper could easily change what the history says. This is a major missing feature. There's currently talk of making Darcs use Git as a back-end, which would give it the hash-based history, but Git's whole-file view of the universe isn't a terribly good match for Darcs's patch-based one. I think a more Codeville-like history would be a much better fit, but doing inference of patches on the client side based on the history would be a workable solution. Darcs also suffers from containing some extremely bad asymptotic runtimes in algorithms which it uses, which turn up in not terribly uncommon cases. This is currently being worked on, but is an area of actual research rather than simple optimization. Because of these problems, Darcs isn't really ready for prime-time yet, but may be in the not too distant future. Darcs's big advantage right now is that it have very good extensive support for cherry picking, a feature which is planned for Codeville but not implemented yet, and I'm not sure if it's planned for Monotone.

Monotone is a fairly mature, mostly traditional three-way-merge-based system with a hash-based history. It has decent network protocol and rudimentary (but far from complete) support for renames. (Git doesn't have support for renames, a hard-to-change architectural decision which was made when it was supposed to be a quick hack temporary solution). Monotone's merge algorithm isn't as good as Codeville's, and there's been some talk of making Monotone use Codeville's merge algorithm, but that's an involved topic which noone's sure the future of. Monotone also supports some nice certification functionality, whose importance is unclear and which could be added to other systems. A hash-based history gets a lot of security to begin with, and the certs don't carry over between format changes, so they're causing a fair amount of possibly unnecessary pain for the time being.

Codeville is also fairly mature, and is having the last few rough edges polished up right now. It has a good network protocol, (technically, it will in about two weeks), good (but not quite complete) support for renames, a well-done hash-based history, and the best merging of any available system. There's a subtle architectural distinction in the history approach of Monotone and Codeville - Monotone records the secure hashes of all old versions, while Codeville records the changes from the old hashed versions. Monotone's approach is less simple in the end. The problem is that for efficient transfer over the wire you need to pass deltas rather than full copies, so you need to cache the deltas, or generate them on the fly, and they need to be integrity checked as they come down, which means a whole lot of hash checking of intermediate versions. The on paper advantage of storing complete copies is that if you cache full complete copies on disk you don't need to run regeneration code when making a new checkout, but that proves to be invalid in practice because the operation's performance is completely dominated by the number of hard drive seeks it has to do, which results in some weird results like a Codeville checkout being faster than a cp -a. Number of seeks optimization is an interesting subject which is beyond the scope of this entry.

As you've probably gathered, Git's quick hack nature is readily apparent, even ignoring that it hasn't even gotten started on implementing merge yet. A hopeful sign is the development of Cogito, which is a front end to Git with a reasonable interface. If everyone starts using Cogito, then it would be a simple matter to make a Codeville- or Monotone- based back end which was command identical to Git, but also had renames, and a non-sucky network protocol, and decent merging. That depends, of course, on people actually using Cogito as their standard Git interface, which is mostly dependant on what Linus wants, and like I said, we don't currently have any idea what Linus is thinking.

Monotone and Codeville have been growing closer over time. Whether they'll reach a complete unification at some point is an involved topic which hasn't been fully explored.

You can read much of the discussion which has been happening around version control systems at loglibrary. #codeville isn't logged yet though.

The old Linux kernel history is now readily available from SourcePuller, which was written by Tridge. It was the writing of this (extremely simple) script which caused the BitKeeper license to get yanked. Ironically, SourcePuller has helped with damage control from the BitKeeper license yanking by making the full old history (not just the linearized version available from CVS) be available. Also ironically, Git is currently using rsync for its network protocol, and rsync was written by Tridge. Rsync is actually quite dated, and the wrong tool for that particular job in any case, but that's a whole other subject.

Full disclosure, of course, is that I'm the founder of Codeville and a current contributor (although most of the work these days is done by Ross). While this makes me a bit biased, it's also resulted in me having a much better sense of what's going on.

By coincidence, the Monotone and Codeville web sites are hosted on the same server.

I could go more into some of the personalities involved and implementation details of some of the systems (like what languages they use) but I've already spent way more time on this than intended, so that's all for now.

Sun, Apr. 24th, 2005 06:36 pm (UTC)
weeg: Rsync?

I'd love to hear why you think rsync is dated; I can see how it would be a poor choice as a protocol for this purpose, but in its niche it seems reasonable.

The only other tool I know of is unison, and it has its issues as well.

Sun, Apr. 24th, 2005 09:46 pm (UTC)
[info]bramcohen: Re: Rsync?

The technical issues are somewhat involved, so I'll have to post about it in another entry, but the basic idea is that it has fixed fairly large network overhead, even when no data has been changed at all.

Sun, May. 1st, 2005 02:53 am (UTC)
(Anonymous): Re: Rsync?

The latest versions of rsync have special handling for various
kinds of archives. For example, I was surprised to see a large
zip file transferred very quickly, until I realized that it
updated the existing zip file rather then sending the whole thing.

It should be easy to define a more efficient protocol for a git
repository.

For example, the objects in a git repository are spread across 256
subdirectories; if the mtime on one of those subdirectories is the
same locally and remotely, you don't even need to list the
contents of that subdirectory.

If the subdirectory changed, then you only need to list its
contents (a very fast sequential read of the directory); you do
*not* need to stat each of the files (which requires a lot of
seeks, unless you have a hot cache) to get the mtime, because the
file name uniquely determines the file contents.

Now, the file names themselves can amount to a lot of network
traffic (although you can halve it by converting the hex SHA-1
name to binary). But, you only have to transfer the names if a
subdirectory has a new file. And, even this can be sped up with
some cleverness. For example, you could group the filenames into
16 different buckets (using 2 hex digits from the SHA-1 in each
file name), and then compute the SHA-1 of the sorted filenames in
each bucket, and send those sixteen values; after that, you only
need to send the names for the buckets that changed.

-scott
Re: Rsync? - (Anonymous) Expand

Sun, Apr. 24th, 2005 06:42 pm (UTC)
[info]taral

Darcs has "contexts" which can be used to reproduce arbitrary versions.

Sat, May. 7th, 2005 06:26 am (UTC)
[info]bramcohen

My impression is that those are currently a form of tagging which isn't universally turned on by default.

Sun, Apr. 24th, 2005 08:58 pm (UTC)
(Anonymous): Some context...

I really don't intend to fan any flames here, overall the article is interesting and informative; however, I think some context is in order.

It really is unfair to paint Linus as the mysterious evil overlord who hates free SCMs.

Over the years, Linus has made it very clear what he wants from an SCM. He has frequently and verbosely pontificated on the subject.

So much so that once upon a time, Larry sat him down with a few beers to get him to elaborate. Larry, with capitalism and altruism hand in hand, went off to his dungeon to give it a shot. When he returned, Linus repeatedly rejected it, each time sending Larry back off to the dungeon for more work. In this time, BK (and its license) were good enough for a number of free software projects. Once Linus was finally satisfied, only then did the flamefest begin...

Throughout this time, everyone knew Linus wasn't happy with the world of SCM. Larry took initiative. I don't know Linus personally, but I don't think he is the kind of guy who would've turned down a few more beers to talk about SCM with someone else along the way...

If you want to know what Linus wants in a SCM, look no further than BK. I don't mean to copy or reverse-engineer it; but investiage it and learn what working paradigm it enables. Then, find another solution to that paradigm. Linus has repeatedly described his utopia and discussed where the attempts have gone wrong. People don't listen...

When all the Tridge/Larry drama went down and Linus realized he would have to find another solution, he and other kernel developers took a fresh look at what free SCM had to offer. Linus went further than looking... Look at the Monotone changelog describing those performance improvements. Credit goes to Linus himself. Performance was improved as he once again reached out with suggestions and even code.

Rather than accept something technically inferrior or whine about it indefinately, Linus practiced what he preaches "Shut up and code!". Monotone got faster and then git was born. Like all good hackers, Linus is inherently lazy; if another SCM would have been good enough, he wouldn't have written git for fun.

As for the growing complication and declining performance of git, this appears to be a fiction created by the author. Yes git has changed, and yes it has grown, but the original data structures and operating philosophy have remained essentially unchanged while performance appears to be improving all the time.

What has happened with git is truly remarkable and may never happen again. The most prominent and sexy FOSS project was faced with a fundamental emergency. The most famous and worshiped FOSS developer drops everything to 'shut up and code'. A spontaneous eruption occurs as a community is formed and SCM is taken from the back room to center stage.

A certain amount of resentment is natural, but try to keep the big picture in mind. As for git, either hop on board or compete head to head. Either way, 'shut up and code'.

Rob

Sun, Apr. 24th, 2005 09:41 pm (UTC)
[info]bramcohen: Re: Some context...

I have looked at BitKeeper, at least to the extent that I'm allowed to, and it appears to have very much the same interface that Codeville/Darcs/Monotone do. I'm not sure what features Linus really wanted to begin with (there was much talk of involved cherry-picking things which Linus wanted) but Linus has recently said that BitKeeper made him change his process - basically by changing it into the simple 'pull everything' process.

As for the growing complication and declining performance of git, this appears to be a fiction created by the author

You are simply wrong. Git is a weekend hack which looks like a weekend hack, and everything else started out looking like a weekend hack as well.

Monotone got a lot faster in a weekend simply because noone had bothered to try to optimize it for projects the size of the kernel before that. Nothing mysterious there.

My big beef with Linus right now is that he's applying a ridiculous double-standard to all the other projects and his hack project. Git has only one benchmark which it does well at, and ranges between awful and unimplemented for a whole bunch of others, and yet he's proclaiming it usable because it can do a checkout in a few seconds on his own machine, which happens to have gigabytes of memory and a hot cache.

As for git, either hop on board or compete head to head. Either way, 'shut up and code'.

You seem to have missed that I'm doing exactly that.

Sun, Apr. 24th, 2005 11:23 pm (UTC)
(Anonymous): Re: Some context...

As author of that Monotone changelog, I should probably provide a fact or two here -- Linus hasn't actually contributed any code to Monotone, or, to the best of my knowledge, any SCM besides git. He didn't really provide any suggestions either, beyond "this is too slow!" ;-). He's credited there because it was in discussions with him that I found the right test case to track down one of our major performance bugs. I debated for a bit whether I should actually credit him by name for that, exactly because it was likely to give people strange ideas, but, figured, if it had been anyone else I would have, so... *shrug*. FYI.

Overall, I find your post to have a somewhat odd slant to it. Yeah, Larry tackled the problem of SCM earlier than the serious free software efforts. One can come up with various reasons for that, but all I know is that I was doing other things back then, can't speak for any other free SCM authors :-). I won't comment on Larry's motivations wrt BK and the kernel; plenty of people have hashed out different sides to that, and I don't have any special insight. But this version of history where "a spontaneous eruption occurs [and] a community is formed" around SCM because of the BK mess is very odd. There've been a lot of people working on SCM for a while, now, and the systems have been getting quite good. I would have given it, mm, maybe a year, before BK started getting trounced by free competitors. This recent mess has created a lot of publicity for that community, and a lot of increased development effort speeding things up, but it hasn't really changed the face of things much. (Except by the addition of 'git' to the mix, but it isn't clear yet whether git actually brings anything new to the table; time will tell.)

Anyway, back to coding.

-- Nathaniel Smith (njs@pobox.com)

Mon, Apr. 25th, 2005 03:48 am (UTC)
[info]unixronin

Do you happen to have a pointer to the svn diatribe Pete Zaitcev mentioned? If there's valid issues with svn, I'd like to see it, as it may have relevance to a job I'm waiting to hear back on.

Mon, Apr. 25th, 2005 03:49 am (UTC)
[info]unixronin

guh. Bad fingers, no donut. CRITIQUE was the word I wanted.

It's late.

Mon, Apr. 25th, 2005 04:00 am (UTC)
[info]bramcohen

I'm not sure what reference you're talking about, but it's probably to diagnosing svn.

Subversion is probably a bit of an improvement over cvs, but it's fundamentally a centralized system and staying that way, which is why I look down on it with such disdain.

Mon, Feb. 13th, 2006 01:40 pm (UTC)
[info]zaitcev

I am not sure precisely what we are talking about here, but I suppose that I referred the "Great Programmers" piece, which went like this:

The simplest architectural problems to solve are the ones which for lack of a better theory most people ascribe to emotional or psychological problems. These are decisions for which there's no rational justification whatsoever. For example, writing a non-speed-critical program (which is most of them) in C or C++. A few years ago you could justify that because the other languages didn't have such extensive libraries, but today it's ludicrous. Another one is building one's protocol as a layer on top of webdav. And another one is building a transactional system for retrieving any subsection of any point in the history of an arbitrarily large file in constant time when that isn't part of project requirements. Yes, I'm making fun of subversion here. It's a great example of a project permanently crippled by dumb architectural decisions.

Tue, Apr. 26th, 2005 02:24 pm (UTC)
[info]zojas: arch

unfortunately, gnu arch (tla) is the only advanced system I know how to use well. can you go into a little more detail; what about its interface is so clunky? my shell is set up to tab-complete the long names for me, so please complain about more than that. :) any info would be interesting. thanks. it seems to me that it certainly has history recording, and also merging covered pretty well.

Tue, Apr. 26th, 2005 05:02 pm (UTC)
[info]bramcohen: Re: arch

You should try using one of the other systems, that will give you a much better sense for the UI differences than any explanation I might give you.

Fri, Apr. 29th, 2005 04:29 pm (UTC)
[info]ciphergoth

I'd be very interested to hear you compare the merits of the three merge algorithms you mention in more detail. I have to confess to being very inclined towards monotone, not least because it seems to be the one furthest from the "all the world's a text file" assumption, but obviously one wants the best possible merges in such a project.

Sat, May. 7th, 2005 06:45 am (UTC)
[info]bramcohen

Monotone's is actually the crudest of them all, see Nathaniel's three way merge considered harmful post. The new Codeville merge algorithm seems to combine the best aspects of all three approaches, although the first version won't have cherry picking (a later version probably will).

Tue, May. 3rd, 2005 11:40 pm (UTC)
[info]orib: Issue Tracking, and more?

Two issues I have with version control systems, including the latest-and-greatest distributed systems:

1. Issue tracking. Effective issue tracking must be aware of branch ancestry in some way or another - e.g., if I release xyz-3.4.0, and branch what would later become 3.4.1 and 3.5.0 out of it, I want a bug reported against 3.4.0 to be listed against those newer branches. Furthermore, I would also want it listed against xyz-3.3.x and earlier, possibly back to xyz-0.0.1, if it originated there (something I don't expect whoever put the issue in the issuetracking database to be able to tell).

The problem is obviously not well defined, but I suspect some help from the version control system is required for any solution. I think it makes sense that if you sync other people's code, you should also sync their bug database (and how to properly merge that is yet another all new problem).

2. Multiple file ops. I tend to work bottom up, starting with large-scope, small-linecount files, and breaking them up to smaller files as all the details materialize and the linecount rises. I may split files more often than others, but I suspect splitting & merging of source files is not significantly less common than renames. Yet, no version control system to date considers this important enough to track.

Monotone allows one to manually proclaim a merge by listing all merged entities as parents, and to proclaim a split by listing the original file as a parent of each part. This would allow an annotate/blame feature to be actually useful through a split/merge. Monotone doesn't do this by itself, though.

At the expense of diskspace and a few heuristics, it's possible to provide this functionality - if you can efficiently search parts of the repository (e.g., keep a suffix tree of the entire repository contents), you can say that "X is Y's parent if 90% of Y comes from X" (catches splits, renames, copy-and-modify), and that "X is merged Y, Z, W" if Y,Z,W are ok with X as parent and 90% of X is covered by Y,Z and W.

Ill defined again, but I believe a set of few simple heuristics can properly track changes and ancestry without requiring the user to explicitly announce renames, copies, links, moves etc.

Any thoughts on the subject?

Wed, May. 4th, 2005 05:15 pm (UTC)
[info]ciphergoth: Re: Issue Tracking, and more?

As regards (1), you wouldn't expect the VC system to store this information, so what you're asking for is for the VC system to export enough information for the issue tracker to handle the problem. For VC, then, the problem is just "make metadata available" and everything else is punted over the fence.

Sat, May. 7th, 2005 06:46 am (UTC)
[info]bramcohen: Re: Issue Tracking, and more?

Integrating issue tracking and the version control system is a good idea, although that should be done at a higher layer. The issue tracking system's database can be kept as a file or files in the codebase, and a tool could parse and modify that file as appropriate.

Sat, Jun. 18th, 2005 05:32 pm (UTC)
(Anonymous): merging comparison

is arch star merge the sort of 3-way merge nathaniel smith considers harmful? and what do you think about: http://subversion.tigris.org/variance-adjusted-patching.html from the disaster-svn?

Sat, Jun. 18th, 2005 05:33 pm (UTC)
(Anonymous): Re: merging comparison

star merge link:
http://wiki.gnuarch.org/Tla_20Reference_2fstar_2dmerge

Mon, Jun. 27th, 2005 04:43 pm (UTC)
(Anonymous): mercurial

What about mercurial?

Mon, Jul. 16th, 2007 11:35 pm (UTC)
[info]suppressingfire

Wincent Colaitua's referencing this post and claiming that Linus was right: http://wincent.com/a/about/wincent/weblog/archives/2007/07/a_look_back_bra.php

Also, since it appears you've been working on Bazaar, how does it now fit into this categorization?

Tue, Jul. 17th, 2007 12:52 am (UTC)
[info]bramcohen

First of all, I'd like to say that this 'I'm smarter than you' bullshit is really sickening. I was distinctly on edge in that earlier discussion, mostly because Linus was being such a jerk. The blog post you link to is propogating the same crap, by framing all events since then in the context of picking out who it shows is smarter.

I haven't worked on bazaar - they just used some code I wrote. Some code which, I'd like to repeat, everybody else can and should as well, including git.

By any reasonable standard, both git and codeville are failures - git is hardly used outside of the linux kernel, and codeville is hardly used at all. It's a bit of an unfair comparison, in that git has had huge amounts of resources plowed into it, while codeville hardly any. I hadn't written any code on it for a while at the time that argument happened, and haven't since. I will readily admit that codeville had some serious issues (and probably still does) which make it inappropriate for significant use.

There were basically two arguments going on there, one having to do with architectural scalability and performance, and one having to do with merge algorithms.

The argument about performance was basically that Linus claimed that git was super-fast and super-scalable. That was, in fact, incorrect. Since then, the guts of how git behaves have been completely re-done so that the 'fast' operations are fast because they simply make note of what happened, and then you're expected to run a batch process where all the heavy lifting is done periodically. This is a perfectly reasonable trade-off, and one which I wish everything else would do, but far from a vindication of git's architecture - it's an approach which could be taken in any system. Git still has some nasty performance problems, by the way, basically having to do with networking (mercurial does it right).

The other argument was over what sort of merge algorithm is the correct one. Since that old argument I've done quite a bit more work on merge algorithms, and it turns out that there are basically two features, implicit cherry-picking and implicit undo, and you have to pick one. I did a lot of good work on how to implement implicit cherry-picking, but it turns out, disappointingly, that the vast majority of projects want implicit cherry-picking, which disappointingly means three-way merge. There are a lot of details of exactly how the three-way merge is done, which have had significant improvements made to them, hence the code adopted by bazaar.

The subtle edge case has to do with implicit moves between files. Basically Linus advocates a system which will take sloppy patches and try to apply them wherever if the changed file is no longer present. This was hardly a new idea, nor is it terribly hard to implement, but the real question is whether you trust it to not screw up, or suddenly change how it behaves during development. Obviously it can work fine in a few simple cases, but whether it can screw up has a lot to do with how branched a project can get and how frequently such features are used. The way development works in the linux kernel it hasn't become a problem (or at least, if it has noone's complained about it) but it isn't something I'd advocate using as default behavior for all projects, at least not without it differentiating between proper and heuristic merges, and warning you about exactly what it's going to do before a heuristic merge, and even then there are cases which will get borked no matter what, for example if someone moves and extensively changes a file as a single operation.

There's a list of possible approaches to file moving, all of which have their pluses and minuses, and none of which are a panacea. The one git uses isn't the simplest and most expedient, and it isn't the one I advocate, but it can be made workable for a lot of projects.

There's more version control theory I really ought to post about, it turns out if you decide to go with three-way merge it's possible to get much more coherent branch organization than total anarchy.

Tue, Jul. 17th, 2007 06:50 am (UTC)
[info]bramcohen

Oh, another thing I forgot about - my point about whole-tree three-way merge being fundamentally borked has to do with particular edge cases where there's no particularly good ancestor for three-way merge to use, and forcing the merge to be whole tree aggravates that situation quite a lot. This actually has happened in git (although it's rare with typical usage in the linux kernel).

My new theory related to three-way merge has to do with controlling branching in such a way as to disallow the bad merge cases from happening.