The code monkey's guide to cryptographic hashes for content-based addressing
For the complicated (and common) case in which the data is partly changed and the bandwidth of the hashing function is higher than the bandwidth of the I/O link, the tradeoff point depends on the percentage of data altered and the relative bandwidths of the hash and the I/O (and to a much lesser extent, the size of the hash output). It most cases, there will be a clear advantage in one direction or another after a few minutes of hand calculations - just figure out the hash bandwidth, the I/O bandwidth, and pick a few reasonable values for percentage of data changed. If you want to get fancy, factor in the cost of transferring the hashes over the I/O link.
Measuring bandwidth: Finding the bandwidths of your hash functions and I/O links is easier than ever before. The OpenSSL library has a built-in benchmark command:
$ openssl speed
bonnie++ or a simple cat file > /dev/null will find the bandwidth to your local hard drive, and netperf will report the bandwidth of your network. Be sure to choose the correct block, file, or packet sizes to evaluate as bandwidth varies wildly according to the size of the data element being processed or transferred.
Long-term storage of hashes: CBA is often used for addressing and deduplicating data for storage systems. In this case, the hash values are computed once when the data is first stored and reused many times, so the cost of computing the hashes is amortized over many operations. At the same time, storing the hashes will increase code complexity, reducing one of the benefits of CBA.
Alternatives to evaluate: Sometimes the more mundane methods, such as traditional hash tables, traditional differencing, source control systems, and compression, are overlooked when they would actually perform better than CBA. Computing a very cheap hash and then comparing the full data for anything with a hash collision can be faster than computing an expensive cryptographic hash. In the same vein, another strategy is to use a series of progressively stronger hashes, such as the rolling hash used in the first stages of rsync. Keeping track of the differences between two versions (rather than a fully stateless approach) and sending only those differences is another traditional option. Besides the well known but limited diff tool, xdelta computes efficient differences between files in any format. Compression can change both sides of the equation, depending on the ratio between the "bandwidth" of compression and the I/O bandwidth. Again, a little hand computation will often determine the fastest algorithm quickly.
An illustrative example is rsync, the remote file synchronization program. When synchronizing files between local and remote hosts, rsync uses CBA by default. But when the files to be synchronized are in what looks like a local file system to rsync (such as /home/val/src/), it sends the whole file instead of finding and sending only the changed parts of the file. This heuristic is usually right, but can occasionally be fooled by really high bandwidth networks or really low bandwidth disks. From the man page for rsync:
-W, --whole-file
With this option the incremental rsync algorithm is not used and the whole file is sent as-is instead. The transfer may be faster if this option is used when the bandwidth between the source and destination machines is higher than the bandwidth to disk (especially when the disk is actually a networked filesystem). This is the default when both the source and destination are specified as local paths.
What happens when the chosen cryptographic hash function is broken?
Given that, historically, popular cryptographic hash functions have a useful lifetime of around 10 years, it is only prudent to plan for the eventuality of a successful attack on the hash function. For some applications, the ability to intentionally generate hash collisions has no significant effect. For others, it will result in data corruption or replacement with malicious data.
F-Secure Warns About a Worm Affecting Corporate Networks 2009-01-08 16:42:00+11
Fortinet Cures Mobile Phone “Curse of Silence/CurseSMS” Attack 2009-01-07 16:30:00+11
SEAGATE SHIPS DESKTOP HARD DRIVE WITH WORLD’S HIGHEST AREAL DENSITY – 500GB PER DISK 2009-01-06 15:34:00+11
New FileMaker Pro 10 Ships With Sleek New Interface and Breakthrough Reporting and Automating Features 2009-01-06 12:21:00+11
Lexar extends KODAK offering with Secure Digital High-Capacity, High-Speed Memory Card 2009-01-06 09:36:00+11



