Log in

No account? Create an account

Sat, Apr. 23rd, 2005, 10:16 pm
Version Control Shenanigans

For those of you who have been following the recent Bitkeeper shenanigans, I'm now going to give the inside scoop on what's happening in free distributed version control.

There's one main interface to distributed version control, which is essentially how Darcs, Monotone, and Codeville all work. Arch and its kin all have a much klunkier UI which basically makes them a non-contender for projects like the Linux kernel. There are some other potentially competitive projects, such as Vesta, which is extremely mature and powerful but currently doesn't do merging, Bazaar-ng, which is currently mostly vaporware, and svk, which I think has recently switched from an arch-style to a darcs-style interface, but I can't say for sure. There are probably other systems worth at least a mention, but for now I'll just stick to discussing the (to my knowledge) most mature and promising systems.

A new system is Git (more of a version control back end than a version control system), which Linus Torvalds hacked together and initially said was a 'stop-gap' measure, but now appears to be getting quite excited about, and is thinking may be a good long-term solution. Git was originally supposed to be simple and fast, except new developments are heading it in the direction of being not so simple and not so fast, and its network protocol is currently a disaster. The good news is that Git is basically a ripoff of a bunch of the architectural ideas behind Monotone, which are good ideas, so it can't be as big of a disaster as, say, Subversion, but it's currently extraordinarily similar to a very old version of Monotone, which makes parallel development of Git and Monotone seem like a waste of resources. The initially stated reason for the new system was that Monotone was too slow, but just a little bit of optimization has made Monotone many times faster than it was before, and most of the remaining performance difference is caused by sanity checks, which can simply be turned off. There are perfectly reasonable long-term strategies which involve separate Git development for the time being, and I'll get to those later, but we in the distributed version control world really have no idea what Linus is thinking at this point.

Darcs, Monotone, and Codeville implement one each of the known ways to approach distributed merge - patch commutation, three-way merge, and two-way merge with history, respectively. Most other projects wind up using three-way merge, although it's the least powerful of the three. Darcs and Codeville were both motivated by the invention of their merge algorithms, and neither of the approaches has to my knowledge been invented independently.

Now, for some comparisons -

Darcs is lacking in hash-based history, which means that past versions can't be reproduced and that an interloper could easily change what the history says. This is a major missing feature. There's currently talk of making Darcs use Git as a back-end, which would give it the hash-based history, but Git's whole-file view of the universe isn't a terribly good match for Darcs's patch-based one. I think a more Codeville-like history would be a much better fit, but doing inference of patches on the client side based on the history would be a workable solution. Darcs also suffers from containing some extremely bad asymptotic runtimes in algorithms which it uses, which turn up in not terribly uncommon cases. This is currently being worked on, but is an area of actual research rather than simple optimization. Because of these problems, Darcs isn't really ready for prime-time yet, but may be in the not too distant future. Darcs's big advantage right now is that it have very good extensive support for cherry picking, a feature which is planned for Codeville but not implemented yet, and I'm not sure if it's planned for Monotone.

Monotone is a fairly mature, mostly traditional three-way-merge-based system with a hash-based history. It has decent network protocol and rudimentary (but far from complete) support for renames. (Git doesn't have support for renames, a hard-to-change architectural decision which was made when it was supposed to be a quick hack temporary solution). Monotone's merge algorithm isn't as good as Codeville's, and there's been some talk of making Monotone use Codeville's merge algorithm, but that's an involved topic which noone's sure the future of. Monotone also supports some nice certification functionality, whose importance is unclear and which could be added to other systems. A hash-based history gets a lot of security to begin with, and the certs don't carry over between format changes, so they're causing a fair amount of possibly unnecessary pain for the time being.

Codeville is also fairly mature, and is having the last few rough edges polished up right now. It has a good network protocol, (technically, it will in about two weeks), good (but not quite complete) support for renames, a well-done hash-based history, and the best merging of any available system. There's a subtle architectural distinction in the history approach of Monotone and Codeville - Monotone records the secure hashes of all old versions, while Codeville records the changes from the old hashed versions. Monotone's approach is less simple in the end. The problem is that for efficient transfer over the wire you need to pass deltas rather than full copies, so you need to cache the deltas, or generate them on the fly, and they need to be integrity checked as they come down, which means a whole lot of hash checking of intermediate versions. The on paper advantage of storing complete copies is that if you cache full complete copies on disk you don't need to run regeneration code when making a new checkout, but that proves to be invalid in practice because the operation's performance is completely dominated by the number of hard drive seeks it has to do, which results in some weird results like a Codeville checkout being faster than a cp -a. Number of seeks optimization is an interesting subject which is beyond the scope of this entry.

As you've probably gathered, Git's quick hack nature is readily apparent, even ignoring that it hasn't even gotten started on implementing merge yet. A hopeful sign is the development of Cogito, which is a front end to Git with a reasonable interface. If everyone starts using Cogito, then it would be a simple matter to make a Codeville- or Monotone- based back end which was command identical to Git, but also had renames, and a non-sucky network protocol, and decent merging. That depends, of course, on people actually using Cogito as their standard Git interface, which is mostly dependant on what Linus wants, and like I said, we don't currently have any idea what Linus is thinking.

Monotone and Codeville have been growing closer over time. Whether they'll reach a complete unification at some point is an involved topic which hasn't been fully explored.

You can read much of the discussion which has been happening around version control systems at loglibrary. #codeville isn't logged yet though.

The old Linux kernel history is now readily available from SourcePuller, which was written by Tridge. It was the writing of this (extremely simple) script which caused the BitKeeper license to get yanked. Ironically, SourcePuller has helped with damage control from the BitKeeper license yanking by making the full old history (not just the linearized version available from CVS) be available. Also ironically, Git is currently using rsync for its network protocol, and rsync was written by Tridge. Rsync is actually quite dated, and the wrong tool for that particular job in any case, but that's a whole other subject.

Full disclosure, of course, is that I'm the founder of Codeville and a current contributor (although most of the work these days is done by Ross). While this makes me a bit biased, it's also resulted in me having a much better sense of what's going on.

By coincidence, the Monotone and Codeville web sites are hosted on the same server.

I could go more into some of the personalities involved and implementation details of some of the systems (like what languages they use) but I've already spent way more time on this than intended, so that's all for now.

Tue, May. 3rd, 2005 11:40 pm (UTC)
orib: Issue Tracking, and more?

Two issues I have with version control systems, including the latest-and-greatest distributed systems:

1. Issue tracking. Effective issue tracking must be aware of branch ancestry in some way or another - e.g., if I release xyz-3.4.0, and branch what would later become 3.4.1 and 3.5.0 out of it, I want a bug reported against 3.4.0 to be listed against those newer branches. Furthermore, I would also want it listed against xyz-3.3.x and earlier, possibly back to xyz-0.0.1, if it originated there (something I don't expect whoever put the issue in the issuetracking database to be able to tell).

The problem is obviously not well defined, but I suspect some help from the version control system is required for any solution. I think it makes sense that if you sync other people's code, you should also sync their bug database (and how to properly merge that is yet another all new problem).

2. Multiple file ops. I tend to work bottom up, starting with large-scope, small-linecount files, and breaking them up to smaller files as all the details materialize and the linecount rises. I may split files more often than others, but I suspect splitting & merging of source files is not significantly less common than renames. Yet, no version control system to date considers this important enough to track.

Monotone allows one to manually proclaim a merge by listing all merged entities as parents, and to proclaim a split by listing the original file as a parent of each part. This would allow an annotate/blame feature to be actually useful through a split/merge. Monotone doesn't do this by itself, though.

At the expense of diskspace and a few heuristics, it's possible to provide this functionality - if you can efficiently search parts of the repository (e.g., keep a suffix tree of the entire repository contents), you can say that "X is Y's parent if 90% of Y comes from X" (catches splits, renames, copy-and-modify), and that "X is merged Y, Z, W" if Y,Z,W are ok with X as parent and 90% of X is covered by Y,Z and W.

Ill defined again, but I believe a set of few simple heuristics can properly track changes and ancestry without requiring the user to explicitly announce renames, copies, links, moves etc.

Any thoughts on the subject?

Wed, May. 4th, 2005 05:15 pm (UTC)
ciphergoth: Re: Issue Tracking, and more?

As regards (1), you wouldn't expect the VC system to store this information, so what you're asking for is for the VC system to export enough information for the issue tracker to handle the problem. For VC, then, the problem is just "make metadata available" and everything else is punted over the fence.

Sat, May. 7th, 2005 06:46 am (UTC)
bramcohen: Re: Issue Tracking, and more?

Integrating issue tracking and the version control system is a good idea, although that should be done at a higher layer. The issue tracking system's database can be kept as a file or files in the codebase, and a tool could parse and modify that file as appropriate.

Sat, May. 7th, 2005 08:58 am (UTC)
ciphergoth: Re: Issue Tracking, and more?

I don't get this - surely you usually find a bug in revision X only after revision X has been committed? So you can only annotate the source with the record of the bug after the fact, which means the annotation can't be part of the source?

Sun, May. 8th, 2005 05:01 am (UTC)
bramcohen: Re: Issue Tracking, and more?

Hmm, true, you can't retroactively modify an old snapshot to indicate that there's a bug in it. Clearly something more sophisticated is called for.

Sun, May. 8th, 2005 12:22 am (UTC)
orib: Re: Issue Tracking, and more?

I was trying to think of a way to do that (along the line you mentioned), but couldn't get anything satisfying. The closest I got was keeping a parallel tree, more or less for the reasons ciphergoth noted. Care to share your insights?

Also, your thoughts re: "merging" of the bug state/description?

Sun, May. 8th, 2005 05:03 am (UTC)
bramcohen: Re: Issue Tracking, and more?

The problems of including bug stuff within the codebase hadn't occured to me, it was just an idle thought of mine from a number of years back.

I'm not sure what to do about merging bug state - if one branch has it marked 'fixed' and another has it marked 'not a bug', what should the merged version of the two say? Perhaps my conception of how a bug gets versioned is completely off, I haven't thought about this problem much, and it's apparently a more interesting problem than I thought.

Mon, May. 9th, 2005 01:18 pm (UTC)
ciphergoth: Re: Issue Tracking, and more?

This folds into my plan for a version control system in which the metadata is versioned separately. This means you can correct typos in checkin comments, or introduce intermediate versions after the fact. You could also mark a version as containing a bug, or mark a set of changes as fixing that bug.

I'm pretty sure that a perfect bug presence inference system is impossible - it would have to have a third state inbetween "marked as containing the bug" and "marked as not containing the bug" of "possibly marked as containing the bug". You could then go back and add annotations to help it make its mind up.

Mon, May. 9th, 2005 02:45 pm (UTC)
bramcohen: Re: Issue Tracking, and more?

Sounds reasonable as a general appoach, but I think that's a layer above the main traditional VCS, so all the current work on building a modern one of those will prove applicable.

Mon, May. 9th, 2005 04:54 pm (UTC)
ciphergoth: Re: Issue Tracking, and more?

Except that maintaining, serving and using metadata has always been considered to be a part of the work of building a traditional VCS. If you did things this way, the layer below is no more than a system for mapping hashes of files to files in a distributed way.

Mon, May. 9th, 2005 07:43 pm (UTC)
bramcohen: Re: Issue Tracking, and more?

No, the underlying system still has an immutable history of file versions. The only thing currently in the underlying system which might get moved up is commit comments. Codeville in particular is extremely narrow about what it does, and basically doesn't keep track of metainformation at all.

I'm unconvinced that metainformation shouldn't be versioned and kept in the immutable history and simply refer to older versions, by the way. The scope of such information and problems involved are something I'm not really aware of. Plain old 'retroactively modify old commit comments' is something we've thought about for Codeville, decided has some difficulties, and just punted on.

Tue, May. 10th, 2005 07:45 am (UTC)
ciphergoth: Re: Issue Tracking, and more?

No, the underlying system still has an immutable history of file versions.


Codeville in particular is extremely narrow about what it does, and basically doesn't keep track of metainformation at all.

Surely it needs to know the ancestry of each version in order to infer the weave?

I'm unconvinced that metainformation shouldn't be versioned and kept in the immutable history and simply refer to older versions, by the way.

In the presence of branching and so forth, I think that would be extremely weird!

Tue, May. 10th, 2005 08:21 am (UTC)
bramcohen: Re: Issue Tracking, and more?

Perhaps it would be easiest to just point you to a doc of the next version of Codeville's history format, once we write one, so you can see its scope clearly.

The problem with metainformation like retroactively changing commit messages is that two people on different branches can do it, then when they merge together you have to merge the retroactive changes. We've just thrown up our hands and not tackled that problem.

Tue, May. 10th, 2005 08:29 am (UTC)
ciphergoth: Re: Issue Tracking, and more?

That would be very cool indeed - thanks!

I had envisaged that each author would own the metadata for their own versions, so it could never get branched.

Mon, May. 9th, 2005 01:43 pm (UTC)
ciphergoth: Re: Issue Tracking, and more?

Other things you can do with versioned metadata is easily move the marker that says "this is my best current version, if you want my latest changes merge against this". You can move it backwards as well as forwards.

Or you can correct errors in commit comments that state which bug a particular change is intended to fix.

Obviously I'm assuming a flexible and extensible metadata format here.

I don't think you need "meta-metadata"; you can use a simple date-and-chaining strategy to manage changes in the metadata.