The code monkey's guide to cryptographic hashes for content-based addressing
How hard is "hard?" A good place to start is the brute force "birthday attack" - simply choose inputs randomly, hash them, and see if they collide. This finds a collision more quickly than seems intuitive. If there are 2N possible hashes, the intuitive guess is that hashing half that number of inputs - 2N - 1 - would result in a 50% chance of a collision. In actuality, only approximately the square root of the total number of hashes is needed - 2N/2 - to have a 50% chance of collision. For a concrete example, SHA-1 has 160 bits in its output, for 2160 possible hash values. The birthday attack would take about 280 hash calculations to have a 50% chance of finding a collision. The birthday attack is an upper bound for how "hard" it is to find collisions in a cryptographic hash function. For a good cryptographic hash function, no attacks are known that are significantly faster than the birthday attack.
One practical implication for the programmer is that the number of inputs being compared should be much smaller than the square root of the possible hash values. If the goal is to hash 2N inputs with negligible probability of an accidental collision between any two outputs, then the output of the hash function needs to have at least N*2 bits. This kind of error is not unknown. Several years ago, a programmer at a famous web company proudly explained to me how their web page cache worked. Each page was addressed by the MD5 hash of its contents, truncated to 64 bits (for convenient use of 64-bit data types). Since there were 264 possible hash values and many fewer than 264 web pages, he believed that the hashes would never collide in practice. I had a rough idea how many web pages existed at that time, but to be sure, I asked him how many pages were in the database. The answer: about 4 billion (232) pages, or roughly the square root of the possible hash values. The chance of a collision was actually about 50% when the programmer believed it was negligible. The web has grown since then; if the cache hadn't already suffered a collision, it almost certainly has since then.
Another intuitive error that comes from thinking of the probability of a collision as 1 in (astronomically large number of possible hash values) is the assumption that the chance of collision is completely random and independent. We're trained to think this way in introductory probability classes - just because the last five dice rolls came up sixes doesn't change the probability of the next roll coming up six. But a cryptographic hash function is deterministic - if an input hashes to six the first time, it will hash to six the next time too. The error creeps in when imagining what happens if you do have a collision - okay, there was a collision, but it was incredibly unlikely in the first place, and the chances of it happening twice are even more unlikely. This isn't true; once a collision, always a collision for that data set (unless you can change your hash function).
The life cycle of a cryptographic hash function
Enough cryptographic hash functions have evolved from state of the art to completely broken that some general trends in the life cycle of a cryptographic hash function have become clear.
In more detail, a cryptographic hash function begins life as an initial proposal. At this point, few cryptographers have examined it, and it is considered unreliable; many proposed cryptographic hash functions are cut down in their infancy, broken by another cryptographer soon after they are proposed. As a cryptographic hash function is reviewed and tested by more cryptographers without weaknesses being found, early adopters begin using it in practice and more attention is focused on it. If the hash survives the increased scrutiny that comes along with growing popularity, it enjoys a period of maturity and trust for some years. However, the same popularity that creates confidence in a cryptographic hash also motivates researchers to work on breaking the function. After a few years, a weakness is discovered that allows an attack of a few orders of magnitude less complexity than a brute force attack. At this point, no practical attack exists, but experts view it as a warning sign that a stronger attack is imminent, and begin recommending moving away from the previously pristine hash function. One of the more extraordinary instances of this occurred in 1995, when the NSA recommended moving away from SHA-0 (then known as just plain SHA) and to the slightly modified SHA-1, without explaining exactly why. (SHA-0 was fully broken in 2004.) Around this time, non-experts scoff and compare the complexity of the attack to the number of atoms in the universe or some other impressive physical quantity.
F-Secure Warns About a Worm Affecting Corporate Networks 2009-01-08 16:42:00+11
Fortinet Cures Mobile Phone “Curse of Silence/CurseSMS” Attack 2009-01-07 16:30:00+11
SEAGATE SHIPS DESKTOP HARD DRIVE WITH WORLD’S HIGHEST AREAL DENSITY – 500GB PER DISK 2009-01-06 15:34:00+11
New FileMaker Pro 10 Ships With Sleek New Interface and Breakthrough Reporting and Automating Features 2009-01-06 12:21:00+11
Lexar extends KODAK offering with Secure Digital High-Capacity, High-Speed Memory Card 2009-01-06 09:36:00+11



