SHA-1 has been widely disparaged for security reasons, but it is just fine for checking file integrity. But like all things in computer science, when the files get really large the job takes longer and and things start to get complicated. Recently I’ve been working on finding a faster and more efficient way to verify SHA-1 hash values for 8 Petabytes of files. SHA-1 hashes were generated and saved to a database at file creation time. The files mostly consist of binary files averaging 14 Gigabytes.
At first I focused on hashing speed. Start with the fastest CPUs with the most cores of all the boxes you have. Then get the fastest cracking (uh…hashing) code you can, and compile it very carefully, optimizing it for your platform. This was a natural starting point, since I’ve spent a lot of time over the years playing with John The Ripper, cracking DES, LAN-MAN, NTLM, MD-5, and other weaklings. In that kind of work, the data is tiny and speeding up runtime is the primary goal. To this end, I tried a few different methods of hashing, and eventually found these three to be adequate, listed here in order of how fast they run:
- sphsum, available as C or Java source code, from a project sponsored by the French government.
- sha1sum, available for most OSes and probably already on your Unix, Linux distro. Part of the GNU Coreutils.
- another one that I won’t tell you because it would give away the OS we’re using.
After a lot of very careful tweaking of GCC compiler flags on two different CPU platforms, I squeezed the most speed out of sphsum. But once I started calculating the hashing speed bandwidth in terms of Megabytes per second, I began to wonder what my MB/sec was for the file read. It turned out that file input was the entire battle. It was the bottleneck regulating the real MB/sec of the overall hashing job.
I was calculating file input MB/sec by just doing this:
time cp hugefile.bin /dev/null
Then I’d just divide the filesize (as MB) by the number of seconds returned by the Unix time command. Of course, all of this goes out the window if you’re pulling your files from across a normal Ethernet network. I was pulling from disk. A lot of disk. But not all OSes and CPUs are created equal, and I had to experiment to find the one that could pull from our disk array fastest. Once I’d found a box that could ingest a 14 GB file faster than the hashing could run, I knew I’d reached a reasonable solution. And so my next task will be to spread the pain out to multiple boxes, and multiple disk arrays. And in case you’re wondering, the “bandwidth” I’m getting for hashing a 14 GB file is averaging 154.15 MB/sec.
If you hash smaller files, you can get speeds up to twice this, as described in this excellent Stack Exchange answer.