One of the great ironies of storage technology is the inverse relationship between efficiency and security: Adding performance or reducing storage requirements almost always results in reducing the confidentiality, integrity, or availability of a system.
Many of the advances in capacity utilization put into production over the last few years rely on deduplication of data. This key technology has moved from basic compression tools to take on challenges in the fields of replication and archiving, and is even moving into primary storage. At the same time, interconnectedness and the digital revolution has made security a greater challenge, with focus and attention turning to encryption and authentication to prevent identity theft or worse crimes. The only problem is, most encryption schemes are incompatible with compression or deduplication of data!
Incompatibility of Encryption and Compression
Consider a basic lossless compression algorithm: We take an input file consisting of binary data and replace all repeating patterns with a unique code. If a file contained the sequence, “101110” eight hundred times in a row, we could replace the whole 4800-bit sequence with a much smaller sequence that says “repeat this eight hundred times”. In fact, this is exactly what I did (using English) in the previous sentence! This basic concept, called run-length encoding, illustrates how most modern compression technology functions.
Replace the sequence of identical bits with a larger block of data or an entire file and you have deduplication and single-instance storage! In fact, as the compression technology gains access to the underlying data, it can become more and more efficient. The software from Ocarina, for example, actually decompresses jpg and pdf files before recompressing them, resulting in astonishing capacity gains!
Now let’s look at compression’s secretive cousin, encryption. It’s only a small intellectual leap to use similar ideas to hide the contents of a file, rather than just squashing it. But encryption algorithms are constantly under attack, so some very smart minds have come up with some incredibly clever methods to hide data. One of the most important advances was public-key cryptography, where two different keys are used: A public key used for writing, and a private key to read data. This same technique can be used to authenticate identity, since only the designated reader would (in theory) have the key required.
Cryptography has become exceedingly complicated lately in response to repeated attacks. Most compression and encryption algorithms are deterministic, meaning that identical input always yields the same output. This is unacceptable for strong encryption, since a known plaintext attack can be used with the public key to reveal the contents. Much work has focused on eliminating residues of the original data from the encrypted version, as illustrated brilliantly on Wikipedia with the classic Linux “tux” image. The goal is to make the encrypted data indistinguishable from random “noise”.
What happens when we mix these powerful technologies? Deduplication and encryption defeat each other! Deduplication must have access to repeating, deterministic data, and encryption must not allow this to happen. The most common solution (apart from skipping the encryption) is to place the deduplication technology first, allowing it access to the raw data before sending it on to be encrypted. But this leaves the data unprotected longer, and limits the possible locations where encryption technology can be applied. For example, an archive platform would have to encrypt data internally, since many now include deduplication as an integral component.
Why do we prefer compression to encryption? Simply because that’s where the money is! If we can cut down on storage space or WAN bandwidth, we see cost avoidance or even real cost savings! But if we “waste” space by encrypting data, we only save money in the case of a security breach.
A Glimmer of Hope
I had long thought this was an intractable problem, but a glimmer of hope recently presented itself. My hosting provider allows users to back up their files to a special repository using the rsync protocol. This is pretty handy, as you can imagine, but I was concerned about the security of this service. What happens if someone gains access to all of my data by hacking their servers?
At first, I only stored non-sensitive data on the backup site, but this limited its appeal. So I went looking for something that would allow me to encrypt my data before uploading it, and I discovered two interesting concepts: rsyncrypto and gzip-rsyncable.
rsync is a solid protocol, reducing network demands by only sending the changed blocks of a file. But, as noted, compression and encryption tools change the whole file even if only a tiny bit has been altered. A few years back, the folks behind rsync (who also happen to be the minds behind the Samba CIFS server) developed a patch for gzip which causes it to compress files in chunks rather than in their entirety. This patch, called gzip-rsyncable, hasn’t been added to the main source even after a dozen years, but yields amazing results in accelerating rsync performance.
The same technique was then applied to RSA and AES cryptography to create rsyncrypto. This open source encryption tool makes a simple tweak to the standard CBC encryption schema (reusing the initialization vector) to allow encrypted files to be sent more efficiently over rsync. In fact, it relies on gzip-rsyncable to work its magic. Of course, the resulting file is somewhat less secure, but it is probably more than enough to keep a casual snooper at bay.
Both of these tools are similar to modern deduplication techniques in that they chop files up into smaller, variable-sized blocks before working their magic. And the result is awesome: I modified a single word in a large word document that I had previously encrypted and stored at the backup site and was able to transfer just a single block of the new file in an instant rather than a few minutes. My only real issue is the lack of integration of all of these tools: I had to write a bash script to encrypt my files to a temporary directory before rsyncing them. I wish they could be integrated with the main gzip and rsync sources!
If you are interested in trying out these tools for yourself, and if you use a Mac, you are in luck: Macports offers both tools as simple downloads! Just install macports, type “sudo port install gzip +rsyncable” to install gzip with the –rsyncable flag, then type “sudo port install rsyncrypto” and you’re done! I’ll post more details here if there is interest.
David Slik says
Compression, encryption, de-duplication, and replication can all coexist, you just need to do it in the right order.
You first de-duplicate, then you compress, then you encrypt, and last of all, you replicate.
And, of course, if you really care about your data, you take a has of it at the beginning so you can verify that after all those machinations, you’re still getting your original data back at the end of the day.
David Slik says
Compression, encryption, de-duplication, and replication can all coexist, you just need to do it in the right order.
You first de-duplicate, then you compress, then you encrypt, and last of all, you replicate.
And, of course, if you really care about your data, you take a has of it at the beginning so you can verify that after all those machinations, you're still getting your original data back at the end of the day.
sfoskett says
That’s the traditional way of doing it. But I’m excited by the idea of doing deduplication AFTER compression and encryption using gzip-rsyncable and rsyncrypto!
sfoskett says
That's the traditional way of doing it. But I'm excited by the idea of doing deduplication AFTER compression and encryption using gzip-rsyncable and rsyncrypto!
Pete Steege says
Hi Stephen,
Encryption seems to be evolving as a multi-level requirement. Encryption of data at the end of the line with self-encrypting drives covers data at rest and avoids the dedupe issue.
It’s trickier, as you say, for encrpting data on the move. When do you see a viable solution standardized?
Jered Floyd says
As David says, the right way to do this is deduplicate (or, at least, segment for deduplication), compress, encrypt, replicate (,wash, rinse, repeat). –rsyncable totally works, but it’s a bit of a hack… it’s doing non-optimal segmentation for deduplication, and of course doesn’t help unless you reset your encryption cipher on the same boundaries as well. As you say, this makes the encryption somewhat less secure — again, you ought to be doing your replication over an encrypted channel anyhow.
At Permabit (http://www.permabit.com), we incorporates all of these technologies in our Enterprise Archive product in this order for maximum benefit. As data is being written an in-line process breaks files up in to variable-sized segments for optimal deduplication. Then these segments are (optionally) compressed, (optionally) encrypted, deduplicated, and written to disk. These compressed, encrypted chunks can be replicated, which is also done over an encrypted channel to eliminate traffic analysis. This provides the best of all worlds.
Regards,
Jered Floyd
CTO, Permabit
Jered Floyd says
As David says, the right way to do this is deduplicate (or, at least, segment for deduplication), compress, encrypt, replicate (,wash, rinse, repeat). –rsyncable totally works, but it's a bit of a hack… it's doing non-optimal segmentation for deduplication, and of course doesn't help unless you reset your encryption cipher on the same boundaries as well. As you say, this makes the encryption somewhat less secure — again, you ought to be doing your replication over an encrypted channel anyhow.
At Permabit (http://www.permabit.com), we incorporates all of these technologies in our Enterprise Archive product in this order for maximum benefit. As data is being written an in-line process breaks files up in to variable-sized segments for optimal deduplication. Then these segments are (optionally) compressed, (optionally) encrypted, deduplicated, and written to disk. These compressed, encrypted chunks can be replicated, which is also done over an encrypted channel to eliminate traffic analysis. This provides the best of all worlds.
Regards,
Jered Floyd
CTO, Permabit
MarkDCampbell says
Stephen: Thanks for this post. A customer called this posting out to me; I reference it several times at my blog and call out the potential problems in ordering. Blog postings regarding this are over at http://www.unitrends.com/weblog/ – the most detailed one is at http://www.unitrends.com/weblog/index.php/2010/04/23/backup-compression-encryption-deduplication-and-replication-solution/
DVD Duplicators says
I agree what you have said that David Slik’s idea is the traditional way of doing it. Is the gzip-rsyncable and rsyncrypto idea possible?
Norealemaill says
what a stupid article. The fact is better compression allows for better encryption due to Unicity Distances considerations. Its common knowledge read about Claude Shannon.
sfoskett says
It’s enlightening and positive comments like this that make me glad to write and share!
CD Duplication says
Yeah I agree, It’s bit trickier encrpting data on the move.