Log in

No account? Create an account

Sun, Nov. 7th, 2004, 09:31 pm

Car Emissions

There are two concepts in car pollution which people generally get mixed up. Some exhaust gases are simply stinky and noxious, most notably particulate carbon and carbon monoxide. Those do direct damage to the humans near them and crops which grow nearby and are clearly bad. Pollutants are clearly bad and there isn't much direct economic disincentive for any one person to make their car produce less of them.

The other troublesome kind of exhaust is greenhouse gases, mostly carbon dioxide. The amount of damage caused by these is much less clear, and there's a straightforward economic disincentive to produce them, because they correspond pretty much directly to the amount of gas your car consumes. Carbon dioxide also happens to be produced in mass quantities by respiration.

If you really want to know how clean a car is, look it up on the EPA web site. There are some surprises, for example the honda civic hybrid with a manual transmission has mediocre pollution ratings.

Erasure Codes

People keep asking me about using erasure/rateless/error correcting codes in BitTorrent. It isn't done because, quite simply, it wouldn't help.

One possible benefit erasure codes is that when sending data to a peer there are so many potential pieces that you can send any random one you have and it won't be a duplicate. The problem is that the peer may already have gotten that same piece from another peer, so that benefit is destroyed, and on top of that the overhead of communicating and remembering which peer has what is increased tremendously.

Possible benefit number two is that erasure codes increase the chances that your peers won't already have the pieces which you've downloaded. But simply downloading pieces which fewer of your peers have first handles that problem quite nicely, so a vastly more complicated solution is unwarranted.

Possible benefit number three is that if there's no seed left erasure codes increase the chances that the entire file will be recoverable. In practice, when a file becomes unrecoverable it's because there was only one seed and several downloaders started from scratch, then the seed disappeared after uploading less than the total length of the file. Erasure codes obviously would not help out in that case.

There are other possible benefits and corresponding rebuttals, but they get more complicated. The short of it all is that the possible benefits of erasure codes can be had with much more straightforward and already implemented techniques, and the implementation difficulties of such codes are quite onerous.

While I'm pissing on everyone's parade, I should probably mention another scenario in which everyone wants to use erasure codes and it's a bad idea: off-site backup. If you store everything straightforwardly on each backup site, and each site has two nines (99%) uptime (if it doesn't you shouldn't be using it for backup) then the overall reliability will be six nines (99.9999%). Engineering for more than six nines is nothing but intellectual masturbation, because unforseeable problems completely dominate failure at that point. Therefore one-of-three gets great reliability with unreliable backup sites in exchange for having to store three times the amount of data you're backing up.

With erasure codes, you could make it so that each backup site only had to store half as much stuff, but that two of them would still need to be up to recover data. If you then have four backup sites, there's a savings of 1/3 of the storage versus the much more straightforward approach. This is a pretty small reduction given that the price of mass storage is very small and plummeting rapidly. It also comes at great expense: you have to deal with four backup sites instead of three, and the software is much more complicated. In systems like this, the recovery software not working is a significant part of the chances of the system as a whole failing. Also, any economic benefit of savings on disk space must be weighed against the costs of the software system which runs it. Given the ludicrous prices of backup systems these days, a much simpler albeit slightly less efficient one would probably be a great value.

ECC of course has some great uses, for example data transmission of noisy mediums and storing data on media which can get physically corrupted, and recent developments in it are very exciting, but it's very important to only use sophisticated tools when clearly warranted.

Mon, Nov. 8th, 2004 04:33 pm (UTC)

the civic hybrid manual is a curious case. as you can see by its summary page, the 5HNXV01.33A6 model did have average particulate pollution, but the 5HNXV01.3YCV (presuably a newer model?) now has near-top marks. only the prius automatic is beating it now.

Mon, Nov. 8th, 2004 04:43 pm (UTC)

It appears that the difference is with sales region, not production time. It's very puzzling, because the stinky one is in california, and california is known for its smog checks. Perhaps the california tests are completely borked and encouraging worse overall emissions by hyperfocusing on certain exhaust gases?
All well now - (Anonymous) - Expand
(Deleted comment)

Tue, Dec. 7th, 2004 06:29 am (UTC)
bramcohen: Re: just discovered your lj and added you

If I didn't want people to read what I wrote, I wouldn't post it to the internet.

Thu, Nov. 11th, 2004 11:09 pm (UTC)

It's true that one-of-three gets great reliability with 99%-reliable backup sites, and I agree that it's probably not worth it to build an ECC backup system for this case. But I don't agree that 99%-reliable backup sites are the only ones worth considering.

However, the point of ECC backup is that you can get great reliability with 50%-reliable backup sites, such as DSL-connected machines in Seoul or Comcast subscribers in Ohio. Being able to tolerate 10 failures out of however many backups gives you 99% reliability; 30 gives you your "overkill" six nines. Tolerating 30 failures by duplicating data gives you a 31x cost multiplier; tolerating 30 failures by spreading your data across 300 peers with an error-correcting code that can tolerate 30 failures only increases your cost by 11%, which is noticeably less than 3000%.

In many cases the dominant cost will be data transfer rather than storage. Disks now cost $1 a gigabyte, and they can be erased and used again, but my DSL line costs about $0.15 a gigabyte, and every gigabyte of bandwidth used is used forever.

I agree with your point that simpler and working trumps theoretically superior but unimplemented or broken every time, though. And you would, I suppose, be the person to ask whether HiveCache fits in that elusive third category: superior but working.

Do you suppose Zooko's going to ship MNet any time soon?

Tue, Nov. 23rd, 2004 03:53 am (UTC)

With less than 99% reliable sites the reliability analysis is completely untrustworthy. More to the point, the failures are likely to be very highly correlated. Perhaps 95% is worth considering, but with triple redundancy that still gets you five nines, which isn't worth engineering past.

I don't expect either hivecache or mnet to be usable in a corporate setting any time soon.

Mon, Nov. 29th, 2004 06:43 am (UTC)

I totally agree that we have to justify the use of redundancy on error correction, error detection and data loss. But on BitTorrent, we may overlook a very important factor, the data volume.

Suppose the undetectable error rate for one piece is x% with x is a very small number. If we have 1TB (1GB files downloaded by 1000 people or 4GB files downloaded by 250 people), with average 1MB piece size, transmitted between peers, we have 1,000,000 * x% cases undetected but in error. If x% is 0.0001%, i.e. 99.9999% reliable. We expect to have 1 undetectable error piece. Not bad!

But if the files are still sharing, the error will spread amongst people and no one can tell which file is corrupted. Another 1000 downloaders will add another error piece and it is cumulative.

For some files, it may not be a problem to contain small error in them. Say, in MPEG file, it may affect the movie for a fraction of second. Who notice it and who care? But some files may not allow error at all, .rar file with password may not be opened in such case.

It may be a good idea if BitTorrent can consider special cases like huge number of people share some files or sensitive files to provide optional extra error detection with extra expense in the sharing, not necessary by ECC.

Mon, Nov. 29th, 2004 06:57 am (UTC)

BitTorrent checks data integrity using secure hashes. It isn't susceptible to undetected errors in redistributed data.

Hashes for detecting error and re-requesting are clearly valuable, and are used in TCP as well. The question is whether adding codes which can fix, and not just detect, the errors is worthwile, and the answer is apparently no.
(no subject) - (Anonymous) - Expand

Tue, Dec. 7th, 2004 03:59 am (UTC)
lionkingwf: Erasure Codes

hi Bram,

I partly agree with your comments on Erasure Codes.

Yes, as what you have said an encoding block maybe have been downloaded from other peers. So at this time erasure codes are no better than BT swarming. And because of redundant recovery scheme, erasure codes can't recover files either, when not enough blocks be get.

However, on some other sides I can't get consistent with your views.

1. Most ECCs are better when be implemented simply than be implemented complicated. But not all ECCs are like that. A kind of erasure codes is very suitful for multipoints downloading with large scale peers, called Luby Codes, which using UDP as its transmitting protocol. It has good scalability and high efficiency of recovery, based on XOR operation with data blocks, so it needs only few computing resources and can be simply implemented on usual personal computers.

2. High Storage consumption. Because there are some applications which highly desired of seriously high reliabilty on current unreliable Internet. Compared to bandwidth costs, storage costs are much lower. So the tradeoff between storage and bandwidth exists, people will choose one of the twos.

Most over above all, Luby Codes can get ride of traditional re-request of TCP transmitting. When receivers find error in encoding blocks, these blocks will be discarded without re-requesting. This can improve its efficiency. Because TCP transmission is affected by RTT value, this limited efficiency of TCP protocol.

As discussed above, I think that erasure codes really can help improving swarming efficiency.

Do you agree with me? Any comments of you are welcome.

Tue, Dec. 7th, 2004 06:28 am (UTC)
bramcohen: Re: Erasure Codes

Internet data transfer algorithms require acks for congestion control anyway. Adding some information to those acks about what packets didn't make it adds no overhead. Under some conditions increasing the window size and massaging some other parameters can increase performance, but such tweaking will always perform just as well as and be a lot more utilitarian than transfer algorithms based on ECC.
Re: Erasure Codes - (Anonymous) - Expand

Wed, Dec. 8th, 2004 04:30 pm (UTC)
(Anonymous): Good point

(from Federico Sacerdoti)

Excellent points about using complex software and algorithms for diminishing gains. Its great to see someone being honest about the difficulty and reliability of real software, and factoring into the N-nines calculations.

Until we can prove implementations correct, we had better think long and hard about getting fancy. (I should take this advice myself occasionally).


Wed, Jun. 22nd, 2005 03:33 am (UTC)

I think the network bandwidth maintenance requirements for data correction are playing a major role.

Here is one paper from Berkeley praising the erasure codes: http://classes.eclab.byu.edu/601/wiki/plugin/attachments/CourseCalendar/Reperasure.pdf

And here's another paper from MIT looking at costs of replication vs erasure: http://www.pmg.lcs.mit.edu/~rodrigo/p2p-2ec-final.pdf

It would be interesting to discuss that.

Wed, Jun. 22nd, 2005 05:10 am (UTC)
(Anonymous): why are we worrying about those particualar ideas?

isn't the whole reason bittorrent works better than its predecsorrs, mostly because instead of trying to invent ways to add a few bits here and there to download speed, but rather to add new sources. i think a better topic to pursue would be the ability to remember peers you encountered in the past, even for different downloads. this could speed up bittorrent transfers by reducing 'leeching' even more, particularly because it would tend to favor users who shared more data over time than those who didn't; perhaps ranking remembered users by age of last connection, number of times met, and how 'profitable' it was for the current user to have them as a peer.

Thu, Mar. 23rd, 2006 12:52 pm (UTC)

Just found myself back here due to a random mail trawl. I'm sure you're right that erasure codes aren't worth it for BitTorrent, but there's a slight error in your argument it's probably worth correcting. You say
The problem is that the peer may already have gotten that same piece from another peer, so that benefit is destroyed, and on top of that the overhead of communicating and remembering which peer has what is increased tremendously.
I think you've misunderstood the point of erasure codes. If you generate a random piece, the probability that a peer already has it (or sufficient information to reconstruct it) is practically zero, unless that peer already has the whole file. So you don't have to keep track of your peer's state of knowledge at all; you just send them new pieces until they have as many as the size of the file, in which case (with overwhelming probability) they will be able to reconstruct it. This isn't hard to prove; it's based on the probability of a random set of vectors in GF(2^k)^n being linearly independent.

Thu, Mar. 23rd, 2006 06:51 pm (UTC)

That would be true if only peers which had complete files uploaded pieces, since they're capable of generating random new ones. However, a peer which doesn't have a complete file is stuck re-uploading pieces it already received (or recombining them - I don't know of much academic work indicating what the possibilities are for that) and so you do have to worry about having seen a piece already.