Practical Cryptography Corrected

The book 'Practical Cryptography' is perfectly good for giving an overview of basic concepts in cryptography, but its immediate practical advice to implementers is not terribly to the point or accurate. Here is much more to the point and accurate advice.

  • For a block cipher you should use AES-128. If you don't understand your protocol well enough to know whether there are birthday attacks on your keys, you have bigger issues. (Shame on Schneier for still trying to revisit the AES design competition by yammering on about Twofish.)

  • For an encryption mode, you should always use CTR, and always use a nonce of zero, and never reuse keys.

  • For a hash function, you should use sha-256 until the winner of the sha-3 design competition is announced, and then you should use sha-3.

  • You should always do encryption as a layer outside of authentication.

  • For entropy, you should always do whatever the Python os.urandom() call does on the local platform.

  • For a data format, you should use JSON. (Not XML!)

  • For an RSA exponent, you should always use 2. Technically that's Rabin-Williams, and requires slightly different implementation, but that actually works in its favor. Rabin-Williams has a reduction to factoring, RSA does not.

  • You should never use the same key for both encryption and authentication. If you need both encryption and authentication, you should use two keys. One for encryption, and one for authentication.

  • If you're going to be using RSA, you should learn about encoding for it! This is by far the most historically error-prone part of crypto protocols, and Practical Cryptography bizarrely doesn't even mention it as an issue.

  • You should not parameterize your protocols! That just creates compatibility problems. If necessary you can parameterize it by having two values, one good for the next twenty years, and one good until the end of time, but key sizes have gotten big enough that skipping that first one should be the default.


Maybe someday Schneier will write a book which I can recommend to people who are getting started in cryptography, but I'm not holding my breath.



Are you a good programmer? Try this coding challenge.

Git Can't Be Made Consistent

This post complains about Git lacking eventual consistency. I have a little secret for you: Git can't be made to have eventual consistency. Everybody seems to think the problem is a technical one, of complexity vs. simplicity of implementation. They're wrong. The problem is semantics. Git follows the semantics which you want 99% of the time, at the cost of having some edge cases which it's inherently just plain broken on.

When you make a change in Git (and Mercurial) you're essentially making the following statement:

This is the way things are now. Forget whatever happened in the past, this is what matters.


Which is subtly and importantly different from what a lot of people assume it should be:

Add this patch to the corpus of all changes which have ever been made, and are what defines the most recent version.


The example linked above has a lot of extraneous confusing stuff in it. Here's an example which cuts through all the crap:

  A
 / \
B   B
|
A


In this example, one person changed a files contents from A to B, then back to A, while someone else changed A to B and left it that way. The question is: What to do when the two heads are merged together? The answer deeply depends on the assumed semantics of what the person meant when they reverted back to A. Either they meant 'oops I shouldn't have committed this code to this branch' or they meant 'this was a bad change, delete it forever'. In practice people mean the former the vast majority of the time, and its later effects are much more intuitive and predictable. In fact it's generally a good idea to make the separate branch with the change to B at the same time as the reversion to A is done, so further development can be done on that branch before being merged back in later. So the preferred answer is that it should clean merge to B, the way 3 way merge does it.

Unfortunately, this decision comes at significant cost. The biggest problem is that it inherently gives up on implicit cherry-picking. I came up with some magic merge code which allowed you to cut and paste small sections of code between branches, and the underlying version control system would simply figure out what you were up to and make it all work, but nobody seemed much interested in that functionality, and it unambiguously forced the merge result in this case to be A.

A smaller problem, but one which seems to perturb people more, is that there are some massively busted edge cases. The worst one is this:

  A
 / \
B   B
|   |
A   A


Obviously in this case both sides should clean merge to A, but what if people merge like this?

  A
 / \
B   B
|\ /|
A X A
|/ \|


Because of the cases we just went over, they should clean merge to B. What if they are then merged with each other? Since both sides are the same, there's only one thing they can merge to: B

  A
 / \
B   B
|\ /|
A X A
|/ \|
B   B
 \ /
  B


Hey, where'd the A go? Everybody reverted their changes from B back to A, and then via the dark magic of merging the B came back out of the ether, and no amount of further merging will get rid of it again!

The solution to this problem in practice is Don't Do That. Having multiple branches which are constantly pulling in each others's changes at a slight lag is bad development practice anyway, so people treat their version control system nicely and cross their fingers that the semantic tradeoff they made doesn't ever cause problems.

Python wish list

Now that the moratorium on Python language features is over, I'll put in my thoughts on what new stuff the language could use. I don't have much to suggest, and what I do have to suggest is fairly minor. This is because I'm happy with the language.

new on default parameters

One of the gotchas in python is that default parameters are reused, so if you say:
def spam(eggs = []):

then eggs will get set to the same list every time, and modifications will get carried over between calls. This can be hacked like this:
def spam(eggs = None):
    if eggs is None:
        eggs = []

This works, but is ugly, and prevents passing in None as a value for eggs. It would be better to be able to simply say:
def spam(eggs = new []):

which should do exactly what you expect.

^ on bytes

A strange oversight in Python3 is that bitwise operators don't work on byte arrays. The ^, & and | operators should work on bytes of equal length, doing exactly what they obviously should. Trying to apply them to bytes of unequal length should probably result in an error. It's easy enough to write functions to do these things, but they're slow, and there's only one reasonable semantics for what those operators should do on byte arrays anyway.

raw binary conversion

Maybe this has been added to the standard library and I just haven't heard about it, but a longstanding annoying missing piece of functionality is simple conversion between ints and big or little endian representations of them as bytes. Again, this is easy enough to implement, but is slow when done in Python and is hardly an obscure piece of functionality.

dictionary scrambling

This might be an obscure piece of functionality, but I'd like the ability to change the hashing function which dictionaries use, because I write tests which depend on my code behaving the same way every time it's run, and I'd like to be able to test that it doesn't have any dependencies on the order of iterating over dictionary keys or values.

Censorship resistance attacks and counterattacks

Related to the recent Haystack hubbub, here's a basic overview of censorship resistance tools, of which Haystack was an example (unfortunately a fairly broken one).

For the purposes of these notes, by 'censorship resistance tools', I'll be referring to ones for browsing the web from inside of countrywide firewalls which are trying to limit access, such as Freegate, Ultrasurf, and the like. Obviously there are other forms of censorship and resistance to it, but that's what's being discussed for now.

The usage pattern for censorship resistance tools goes something like this:
  1. system sends information about proxies to users

  2. users use proxies to browse the web freely

  3. firewall operator finds out IPs of proxies and blocks them by IP

  4. go back to step 1

It's an ongoing cat and mouse game involving cycling through a lot of IPs and a lot of careful secrecy.

An attacker might also, instead of outright blocking an IP, artificially create a very high packet loss rate going to it, which might make users conclude that the anti-censorship system doesn't work very well and give up on it. That could be countered by trying to guess when there's an artificially high packet loss rate, but that's potentially an insidious game - the attacker might, for example, determine where the machines developers use for testing are, and not artificially drop packets to those.

There's considerable concern about the threat model of the censor finding out which users are using the proxies and doing bad things to them. I'll just cut to the chase on that issue - the resistance to attacks of that form is inherently weak. The censor can simply record the destinations of all outgoing connections, and retroactively correlate them to discovered proxies, unveiling the IP of a user. This is a vicious attack which can't be completely eliminated. Possession of the tool might also be incriminating.

High level methods of avoiding detection include:

  • Have lots of cover traffic - that is, lots of users, so attacking them all is impractical. This is probably the ultimate solution, because a tool which doesn't have enough users to provide cover traffic isn't successful, and a successful tool implicitly provides lots of cover traffic.

  • Have user use shared/ephemeral IPs. This is a low tech approach having little to do with the protocol.

  • Use no software, that is, http/https proxies. This makes the user have no recurring evidence, but can expose what the user is doing to snooping.

  • Use ephemeral or easy to dispose of software. This is a good idea, but the techiques for doing it are tricky or rely on physical security.

  • Run proxies on web sites running other services which are also used by users within the target area. This is a great approach, but requires cooperation of a web site which has the willingness to be (or confidence it won't be) blocked.

  • Use actual skype connections. This is an interesting approach which has the benefit of lots of cover traffic, but suffers from limitations on the bandwidth skype intermediaries will provide, and could be attacked by an attacker running lots of high quality skype nodes and noticing the very suspicious traffic.

  • Dial down the level of paranoia. In the end a certain amount of this may be necessary.


Censors have multiple ways of finding IP addresses which are used by the anti-censorship system:

  • Use the same methods as the software. This is a very insidious approach, putting the anti-censorship system in a position of trying to simultaneously publish new IPs and keep their distribution limited.

  • Correlation attacks on existing known IPs. This is also a very insidious attack - the attacker simply takes IPs which are known to use the anti-censorship tool, and looks for otherwise unpopular IPs which a lot of those are connecting to.

  • Probing - an attacker can connect to suspected proxies and try to get them to give themselves away by doing a handshake. Depending on the type of proxy connection used, this can be very effective, sometimes in combination with reverse DNS.

  • Trick proxy users into hitting a web site and observe what IPs the connections come from, observing the IPs of the proxies directly.

  • Deep packet inspection and traffic pattern analysis, including packet sizes, connection number and duration, etc. These can be extremely effective, but can be extremely expensive for an anti-anti-censor to set up. Connection number and duration are probably the most telling pieces of information, and the cheapest to implement, as well as the easiest for the anti-censor to manipulate.


There are several ways for an anti-censor to make it hard to find their IPs:

  • Use lots of IPs. If each user can be given their own dedicated IP then the system is extremely hard to attack. Problem is, this approach requires procument of lots of IPs, which isn't easy.

  • Limit how many users info is given to. This is a good idea, but difficult to do.

  • Encrypt info with not widely circulated keys. This moves the problem to key distribution and management, which is a good idea.

  • Distribute fake IPs including stuff the censor would regret blocking. I think this is kind of fun.

  • Have clients only connect to one IP. This is a very good idea! Should be followed as closely as possible.

  • Make traffic go through more than one hop, masking the IPs of proxies to connections on the outgoing side. While clearly a good idea, this doubles the bandwidth used, which kind of sucks.

  • Rely on deep packet inspection being hard. Less unreasonable than you might imagine - deep packet inspection systems are very expensive and take a while to upgrade, and intelligence on what the deep packet inspection can do is sometimes available.

  • Steganographically encode connections to proxies - this obviously must be done, although it isn't obvious what the best approach is.


There are several things proxy connections could be made to look like -

  • HTTP - while there's plenty of cover traffic for HTTP, deep packet inspection and probing can probably be very effective in recognizing patterns in it, making it not very appealing for stego connections

  • SSL/TLS - there's a decent amount of cover traffic for TLS connections in the form of HTTPS, and using the HTTPS port is probably a good approach, especially since the traffic patterns are going to match http anyway, since that's what it is. There's some concern that man in the middle attacks might be launched, although those are difficult, and an attacker might get suspicious if reverse DNS doesn't return believable information. Still, this may be the best option, and is certainly the simplest to implement.

  • BitTorrent - BitTorrent has lots of cover traffic, and the obfuscated version of the protocol looks fairly generic, although its traffic patterns are very distinctive and wouldn't be closely matched by anti-censorship web browsing.

  • utp - utp is a udp-based TCP-alike originally designed for BitTorrent. It has the advantage that some deep packet inspection systems just plain don't support UDP, and it's easy to use as a swap-in replacement for TCP. It has some of the same cover traffic problems as regular BitTorrent.

  • SSH - while tunneling over SSH is not uncommon, making using SSH connections no more suspicious than having long-lived high-throughput SSH connections is to begin with, that's already a high level of suspiciousness, so this probably isn't a great approach.

  • skype - skype traffic has good cover traffic, but is a very poor match in terms of usage patterns.

  • noise - a TCP connection which has just plain garbage going over it is a surprisingly reasonable approach. Lots of weird miscellaneous things on the internet are hard to classify, and obfuscated BitTorrent provides a decent amount of cover.


There are several methods a censorship resistance system can use to get IP addresses out -

  • offline - this is the most secure way, but it's very slow and expensive

  • spam cannon - a spam blast can be sent out containing addresses of proxies. This works but is moderately slow and moderately expensive. It's also potentially very easy to intercept.

  • to existing users - client software can be sent IPs of failback proxies when it makes a proxy connection. This works and is fast, but has the problem that an attacker can run client software and use it to find proxies as well.

  • via web stego - this technique hasn't been used yet, but IPs could be encoded steganographically in real web traffic. Given the tremendous popularity of censorship resistance tools in the west, it might be possible to enlist the help of lots of web sites, and make it essentially impossible to filter them all out. I'm working on technology for this.

Rebasing

Rebasing is a controversial feature, with some people declaring it the greatest thing since sliced bread and others declaring it the spawn of the devil (okay I exaggerate, but you get the idea). I will now set the record straight by explaining The Truth.

Rebase is a good feature.

But it's hard to implement properly on a version control system using the standard Hg/Git architecture.

There are a number of problems with the standard architecture, but the one important for rebasing is that it treats data like a roach motel - changes can get in, but they can't get out. Rebasing is fundamentally about removing some changes in a controlled way, and the closest you can get is to rewrite some history which never made it off the local machine and pretend it never happened.

I'll now describe an architecture which supports rebasing very well. Whether similar techniques can be used to add rebasing to the standard architecture will have to be left for another day.

First, we need a concept of a branch. A branch is a snapshot of the codebase which changes over time. For rebasing to work well, there needs to be a very strong notion of branches, and the simplest way to get that is to have a single centralized repository with a list of branches whose values change over time and whose old values are remembered. To have relationships between branches, we dispense completely with the complex notion of history which hash-based architectures have, and introduce a concept of 'parent'. Each version of each branch has a parent specified, although it could be null. A version of a branch represents two things simultaneously:

(1) A snapshot of the codebase

(2) a patch off of its parent

Individual branches are modified with CVS/svn style update/commit.

Rebasing is now straightforward. You take a diff from the parent to the current snapshot, and apply that to the new parent. This can be done for any potential parent, including the latest version of the branch which the current parent is from, the latest version of another branch, and even older versions of other branches or the current parent. Any reparenting will propagate to all child branches, and changing the parent back will re-propagate nicely as well. This approach allows you to nicely specify what goes where, without having that roach motel feel of code committing.

There would be a number of practical benefits to this architecture beyond allowing nice rebasing, although writing a new version control system from scratch today seems like an extreme approach. In the future I'll comment about the possibility of using parents in Hg/Git, after I've given some thought to it.

Patience Diff Advantages

There's no coherent explanation of what the advantages of Patience Diff are, so I'll explain now. First, a quick overview of how Patience Diff works -
  1. Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.
  2. Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.
  3. Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
  4. Do steps 1-2 on each section between matched lines
I've previously described it with the ordering a bit different and a recursive step at the end, but this ordering always gives the same result and performs much better, and the performance hit of doing the recursion isn't worth it, because it rarely if ever finds any more matches, and even when it does it isn't clear whether the extra matches produce a functionally superior diff. A much more detailed and more graphical explanation is here. A side by side of how that particular diff gets mangled by standard diff algorithms is here. This is the example which motivated patience diff in the first place. If you make extensive changes to a file and remove a function from the end of it and add a function to the beginning, there will be a tendency for an LCS based diff algorithm to match up all of the curly brackets instead of the functions, so if it's matching F1 F2 F3 to F4 F1 F2 instead of viewing F4 as new and F3 as deleted, it will match the curly brackets of F1 to F4, F2 to F1, and F3 to F2. The result is every bit as gross and useless as it sounds, and can easily force a merge conflict in cases which could have been resolved completely cleanly. This has a particularly strong tendency to happen if you put curly brackets on lines by themselves or are religious about putting curly brackets around single-line blocks even though it isn't technically necessary or even if you just put lots of blank lines between your functions.

Another advantage of patience diff is that it frequently doesn't match lines which just plain shouldn't match. For example, if you've completely rewritten a section of code it shouldn't match up the blank lines in each version, as this example shows. Finally, there's this example:


 void func1() {
     x += 1
 }

+void functhreehalves() {
+    x += 1.5
+}
+
 void func2() {
     x += 2
 }
Which is straightforward and obvious, but frequently diff algorithms will interpret it like this:


 void func1() {
     x += 1
+}
+
+void functhreehalves() {
+    x += 1.5
 }
 
 void func2() {
     x += 2
 }
While technically that works, it's obviously inferior and is a bit of a pet peeve of mine. In principle one could tweak an LCS-based diff to always do the right thing here, but the algorithmic complexity of LCS make implementations of it essentially unmaintainable. That's another another advantage of patience diff - it's simple and understandable that it can be modified and extended reasonably, and why it performs correctly in all these examples can be easily analyzed.

Enterprise Chatroulette

Chatroulette is an interesting concept. On the plus side, it reliably provides people to talk to (especially if you're a hot babe), and leads to some actual conversations. On the downside, it has too many (literal) wankers, and feels lonely and disconnected while at the same time feeling like it's always just about to give you some real social interaction in the same way IRC does. I don't feel myself getting sucked in, but then I stopped feeling the allure of IRC a few years ago.

What the world needs is a service which is to chatroulette what yammer is to twitter - the same service, but only for people who have accounts on the same domain. That would allow it to be un-anonymized, since it could display the user's login, and solve the wankers problem, because HR and a lack of anonymity takes care of that. For companies above some size, it would be a great way for people to get the sort of random interaction with coworkers that they normally do in the office while working remotely. It would also probably have an easy upsell to an expensive premium service for any company which started using it a lot, even if they didn't mean to.

A new card game

I came up with a new card game the other day. It's been play-tested a bunch, and is a lot of fun.

The Rules:

There are two players. A standard poker deck is shuffled, and sixteen cards are dealt out face up, eight to each player. Both players get to stare at the cards for a while.

Each player then gathers up their eight cards and sets aside four of the eight cards for the other player. They then exchange the cards they've set aside for each other. Each player then forms the best poker hand they can with any five of the eight cards in their hand, and the one with the better hand wins.

That's the entirety of the rules. What makes play interesting is that if you can guess what cards your opponent will keep there's almost always a set of cards you can set aside which will make you win.

Freemium results in terrible games

Game designers in general think of themselves as performing a valuable service for humanity. Aside from simply entertaining, they have the goal of making their users smarter, more disciplined, and all around better than they were before playing. In large part they can succeed in doing this. Unfortunately the freemium business model doesn't encourage game mechanics which enhance the user's cognitive skills. In fact in directly fights against them.

Freemium at first blush doesn't seem like a bad idea. Players get to try out a game for free, and generally can keep playing for free, but if they're really into then they can pay more for some game extra. This results in cheap games, which make money by servicing their most die-hard fans, with continuous development and, above all, more games. On the surface this sounds like a huge improvement over the flat fee model of trying to sucker as many people as possible into buying a game they only play a few times. The problem comes in with the psychology of players, and it's one rooted deeply in the human psyche which I don't have any good solutions to.

There are essentially two kinds of game players - those who view achievement at a game as validation of their skills, commitment, and endurance, and those who see game achievement as an end in and of itself. The first kind will not on principle ever pay any money to get a leg up in a game, because that would permanently cheapen whatever success they might later have. It would be, to use a vulgar term, cheating. Players who are just interested in progressing, on the other hand, are perfectly happy to do that, and hence are much easier to monetize - just offer them in-game gold for a few real bucks, and on a day when they're feeling frustrated, they'll happily fork it over.

The people who design games, and most of the hard-core gamers, are from the skill camp, and instinctively fight the pay to cheat model, sometimes to a ludicrous degree. When Ultima Online first experienced a large market for in-game items, a phenomenon which had never been seen before, the designers banned the practice of selling items and with to great lengths to stamp out the aftermarket, rather than simply selling in-game items themselves, which would have resulting in them making a lot more money and their players being a lot happier. I personally was very excited when the game Bloons Tower Defense 4 came out, because I'd made it through all the levels of Bloons Tower Defense 3 and greatly enjoyed them. I happily played through most of the levels of 4 on the hardest setting, until I got to the fifth one. That level is so ridiculously difficult that it's clearly just there to get people to pay for power-ups, a form of monetization which simply wasn't there for 3. After banging my head against it for a while, I simply stopped playing. The whole franchise feels tainted to me now, a cheap form of grabbing money rather than a legitimate intellectual pursuit. I realize how silly this is, especially for a game whose pinnacle of weaponry is a Super Monkey Storm, but it's fundamental to my motivation for game playing - if the reward for succeeding has lost it's meaning, then I don't get a positive pavlovian response to it, and I lose my interest.

Once a game starts to do freemium monetization, the natural tendency is to simply abandon aspects of gameplay which require greater cognitive skills altogether. The people who are attracted to such things will on principle never pay any money, and the ones who remain - the ones who really just care about getting ahead in the game by any means, will find anything which absolutely requires developing skills overly frustrating, and will likely just get scared off by it. Gameplay descends into repeatedly clicking next, with enough cute animations and just enough having to move around of the mouse between clicks to keep the user entertained, and the monetization scheme is that if you pay a buck you can skip ahead a few thousand clicks.

I don't have a good solution to this problem. Freemium is a much more compelling games business model than anything else anybody's come up with, and absent any real alternative, the phenomenon of dumbing down of gameplay will continue.