Thursday | 8 January, 2009
LinuxWorld.com.au

The code monkey's guide to cryptographic hashes for content-based addressing

Keeping track of large blocks of content using hash values only can be a a way to save bandwidth and build exciting new applications. Follow some important rules to use hashes correctly and securely.
Valerie Henson 28/11/2007 12:09:39

Enter cryptographic hash functions

Fortunately, hash functions that are collision-resistant (simply put, very difficult to find inputs with same output) have already been developed for another purpose - they are called cryptographic hash functions and are used for message authentication, digital signatures, obscuring passwords, and more. As programmers, we can "borrow" these functions and use them for other purposes. So now your algorithm for detecting whether two MP3s are the same is very simple: Calculate and store the cryptographic hash of all your MP3s. When you want to know if your friend's MP3 is new, you ask for its cryptographic hash and then compare it to the hashes of all your other MP3s. If it is the same as any of the existing MP3s, you assume it's the same MP3 and don't bother downloading or storing it. Otherwise, you start the download and kill time photographing the penguins outside the research station.

This, in a nutshell, is what is variously called content-based addressing (CBA), compare-by-hash (CBH), or content-addressed storage (CAS). Just calculate the cryptographic hash of a piece of data (often at a granularity smaller than an entire file) and use the resulting hash value as the address of the data. Besides saving on network bandwidth, CBA eliminates the need for a top-down address assignment framework - everyone calculates the same addresses without having to talk to anyone else - which is useful for peer-to-peer networks like BitTorrent and distributed hash tables. It also simplifies implementation - saving state is optional, since all the addresses can be regenerated from the content itself.

Content-based addressing also can have serious downsides, depending on the application and the state of cryptographical research. First, cryptographic hashes are expensive to compute, so in many applications where the cost to transfer or compare data is low, CBA will only slow down the application. If you wanted to compare MP3s on your local hard drive, more mundane techniques, such as traditional hash tables, would be much faster than CBA in most cases. Second, the collision-resistance of a cryptographic hash function degrades over time - the cryptographic hash function that was collision-resistant when the program was first written will no longer be collision-resistant in a few years, after other cryptographers figure out how to find hash collisions quickly (hence, the term "collision-resistant" rather than "collision-free").

In the rest of this article, we'll first review cryptographic hash functions, examining their origin, computational cost, mathematical properties, and "life cycle" as they transition from newly proposed to reliably collision-resistant to broken. Then we'll explore in detail how to judge when content-based addressing will improve an application, and what effect the breaking of the chosen cryptographic hash function will have on the application. With this information under your belt, you can confidently judge when to use content-based addressing.

Cryptographic hash functions for programmers

In this section, we'll give a quick 'n' dirty overview of the aspects of cryptographic hash functions most relevant to programmers using content-based addressing. A wealth of detail is available on the web these days, including most publications and pre-prints, and of course, there's always Applied Cryptography (although it's woefully out of date when it comes to cryptographic hash functions and many pine for a third edition).

Most programmers are familiar with the idea of cryptographically signing a document - you use an encryption algorithm to generate a signature using the document as input. Other people can verify your signature given the proper key. The problem is that signing is often a very slow and expensive calculation, proportional to the length of the input. Instead of signing the entire document, it would be nicer to sign a short, fixed-size signature of the document. This signature would have to be easy for everyone to calculate given the original document, but at the same time be very hard for anyone to find another document that has the same signature (in technical parlance, second pre-image resistance). It should also be hard to find a document that has a given signature (pre-image resistance), or any pair of documents with the same signature, ever (collision resistance).

Cryptographic hash functions were designed to fill this and similar needs. The major tradeoffs in their design, as visible to a programmer, are:

  • Complexity of calculation: Too simple, and the hash is easily broken. Too complex, and the hash takes too long to calculate.
  • Size of output: Too small, and brute-force attacks are too easy. Too big and the cost of storing and sending the hash value is too large.
Additional Resources
Newsletter Subscription
Sign up for our LinuxWorld newsletters!
RSS Feeds
 
Sponsored Links