Dropbox recently clarified (via their blog and privacy policy) that they “de-duplicate” user files. This has been known for quite a while, and is obvious to anyone who’s had a large file “upload” instantly. But how exactly does Dropbox store files? Are they really de-duplicated or just single-instanced? I set out to discover the answer.
Single Instance Storage
It’s fairly simple for a system to eliminate duplicate data by storing only a single instance of multiple identical files. In other words, if you and I both upload “Presentation.pptx” and it’s bit-for-bit identical, it would be a simple matter to store just one copy.
Dropbox definitely does this. I proved it with a simple experiment:
- Create a new 10 MB encrypted disk image in TrueCrypt (so it’ll be 100% unique, random data)
- Move it to the Dropbox folder and wait a few minutes as it uploads
- Copy the file with a new name to the folder and notice that it “uploads” instantly
Dropbox is at least single-instancing storage. This helps users, since it speeds uploads and reduces bandwidth usage. It helps Dropbox in the same way, but goes further since they still “charge” files against your account whether they’re single-instanced or not.
Note that this single-instancing works across users and geographies. I gave a file to a friend to upload to a different Dropbox account, and saw the same “acceleration effect.” This would be quite useful to users and the company for files like iTunes songs which are identical and widespread.
Clashing MD5 Hashes?
A global single-instance storage system sounds great, but it opens the door to hash collision issues. Imagine if you and I both uploaded identical files. Both would have the same “fingerprint” and Dropbox would only store it once. Now imagine instead that, out of coincidence or malice, I uploaded a file with the same fingerprint as yours but different contents. This is not so far-fetched as it seems, and could lead to all sorts of security nightmares.
A common and compromised file checksum method is MD5, so I decided to test how Dropbox handles files of identical size, name, and MD5 hash using the “Nostradamus Attack” PDFs generated by Marc Stevens. My tests show that Dropbox correctly handled the files I tried, and no combination of uploading and naming could force it to incorrectly store the right file. So Dropbox either doesn’t use MD5 or uses a combination of hashing and other mechanisms. Testing other schemes is left as an exercise to the reader.
One more thought: The fact that de-duplication is mentioned in the “privacy” section of the Dropbox policies raises my eyebrows, since it indicates that they see this hash collisions as a matter of privacy rather than data corruption. This indicates that Dropbox is both aware of and susceptible to hash collision attacks generally, though obviously not as simply as creating a bogus MD5 match.
Note: Dropbox is well aware of this issue, having recently squashed an open-source exploit called Dropship!
Sub-File De-Duplication
Data de-duplication is like single-instancing, but it applies to some subset of data. Some enterprise storage systems de-duplicate at multi-megabyte levels, while others are far more granular.
To test whether Dropbox de-duplicates data, I devised a simple experiment:
- Create a new local copy of my existing random TrueCrypt file
- Add a single byte to the end using the “cat” command
- Copy the resulting file to Dropbox
- Watch as Dropbox takes just a few seconds to upload the new file
This test proves that Dropbox does indeed de-duplicate at the sub-file level. Since it took a bit longer to upload that would be expected for a single byte, we can see that Dropbox “chunks” files for hashing and uploading.
De-Duplication Granularity
The next question is just what size chunks or blocks Dropbox uses to de-duplicate data. To test this, I created various blocks of random data using TrueCrypt and experimented to see where the “stair-steps” were in terms of upload time.
My tests used four basic building blocks of 512 KB, 1024 KB, 2048 KB, and 4096 KB in size. Guessing that Dropbox used one of these sizes for their chunking system, I assumed these would quickly demonstrate the answer.
First, I uploaded each file individually and watched as Dropbox took about 30 seconds per MB. This will vary greatly, of course, but the absolute performance doesn’t matter. Only relative performance matters for demonstrating chunking.
Next, I concatenated each file with itself to create a new file twice as large. This would be ideally “chunkable” since it consists of exactly identical data with a nice, clean, evenly-divisible “border”. I uploaded each of these and noticed that the “4096 KB x 2” file uploaded nearly instantly, while all others took the expected amount of time.
I repeated this with “x 3”, “x 4”, and “x 8” files and noticed that the 4096 KB (4 MB) “barrier” was very obvious. Whenever a file contained 4096 KB or less of data Dropbox had seen before, it single-instanced it. Any time it saw a unique “block” smaller than this, it uploaded it fresh.
This proves, at least in the case of my own Mac OS X install of Dropbox, that a 4 MB chunk size is used for de-duplication.
Stephen’s Stance
Dropbox is a very useful service, and I appreciate the technology they use to make it work. By single-instancing storage, the company is able to keep costs and transfer time in check and offer a basic service for free for many users. Despite the recent security issue, I continue to use Dropbox myself and would not hesitate to recommend it. But I do suggest using your own encryption for any sensitive data, as demonstrated in my recent post, Mac Users, Secure Your Stuff in Dropbox.
I remain somewhat concerned about the privacy and security implications of global de-duplication of shared random data. If they use SHA-1 hashes alone, which I suspect, there is a chance that an object will not be stored correctly once 2^80 (or perhaps 2^69 or even 2^52) objects are stored. This would lead to issues of data corruption or inadvertent disclosure. This is a very remote chance indeed, but “birthday problems” like this work against hashing systems. I would love to hear from Dropbox regarding how they prevent this from happening, including disclosure of their methods of hashing data. It’s nice to see the company taking responsibility by disclosing this in their privacy policy, though!
Update: Dropbox apparently does indeed use raw SHA256 hashes to “uniquely” identify data, and this can be exploited in a number of ways.
Magnitudes_are_tricky says
“there is a chance that an object will not be stored correctly once 2^80 (or perhaps 2^69 or even 2^52) objects are stored”
With 4MB chunks, this would only be a problem once Dropbox starts storing close to 2^160 MB = 2^140 TB of data. Lets say they’re able to cram 1TB per cm^3 of volume. They’d need a storage facility 2^140 cm^3 in size. Wiki tells me that the volume of the sun is 1.41 x 10^18 km^3, or 1.41 x 10^39 cm^3. This is roughly 2^130 cm^3.
So Dropbox would need a facility roughly a thousand times the size of the sun to get to the point where collision is even a possibility.
I think the privacy issues of having a single block potentially collide with another one would be somewhat moot at that point.
sfoskett says
You’re assuming blocks are all 4 MB in size, but they’re not. Dropbox stores up to 4 MB in a chunk, but any file or chunk smaller than that is an object as well. So perhaps this might only be five hundred times the size of the sun? Hahaha!
I’m actually more concerned about a birthday attack than a mere coincidence. I can think of a DoS and a data leak I could drive by uploading a file with a hash that matched your file’s…
Varenc says
sha256
Magnitudes_are_tricky says
“I’m actually more concerned about a birthday attack than a mere
coincidence. I can think of a DoS and a data leak I could drive by
uploading a file with a hash that matched your file’s… ”
Then the problem becomes: given a (cryptographic) hash of the target block, find a block that hashes to that hash. This is a basic pre-image attack which is not susceptible to the birthday paradox (because the target is fixed).
http://en.wikipedia.org/wiki/Cryptographic_hash_function
Basically, even if you could find collisions on a hash function using some kind of cryptanalysis combined with the birthday property (e.g., for the attacks on MD5), you’d be restricted to attacks that involve those two specific blocks (which are almost certainly going to be two blocks of random-looking bits, rather than just happening to be something like my bank account number or nuclear launch codes). Either way it’s not really something to be worried about.
jlgaddis says
“So Dropbox either doesn’t use MD5 or uses a combination of hashing and other mechanisms.”
Or perhaps, as is common, they use multiple hashing algorithms (e.g. MD5 and SHA256).
As an example, here are the MD5 and SHA256 hashes for ubuntu-11.04-server-amd64.iso:
MD5: 355ca2417522cb4a77e0295bf45c5cd5
SHA256: 13dc0105f81e47308c5fab5fc6d71c7030fcc2a5721df654cd4c26406da4ffab
What are the chances that an attacker could craft a file that matched?
White_Label_Online_Backup says
Seems like they may be using the chunks to hash, so they aren’t fooled by the Nostradamus PDF, but may be fooled. You’re totally right in that somebody can backsolve that.
In most schemes (we also dedupe in an instance fashion, but our software only dedupes within a single users’ account to completely avoid the above issues) a duplicate hash would result in the link being made to an already existing file with the same hash, and so the consequence would be that the user would see somebody else’s data.
But the very fact that they allow data to be shared between users is problematic for most business customers or those with any privacy concerns: it means that there are many other ways that there can be compromises in security compared to firewalling users’ accounts.
In the end, that means that customers have to encrypt their own data before uploading, which means that dropbox is not so useful for SMBs as they might think.
fregs says
Interesting tests, I was just thinking about how it works, so thanks. Didn’t you try the same with Amazon cloud drive / Google drive / SkyDrive?
young says
Unfortunately, Google drive, N drive(in Korea) and SkyDrive do not support data deduplicaiton. I tested several cloud storage, but only Dropbox supports deduplication service. ^^
curious_soul says
I uploaded a file through one account and later after an hour tried uploading the same file through a different dropbox account and yet both the uploads took the same amount of time. Should the second upload be instantaneous as u claim? What might be going wrong here?Has dropbox changed something recently?