The next version of Microsoft Windows Server includes integrated data deduplication technology. Microsoft is positioning this as a boon for server virtualization and claims it has very little performance impact. But how exactly does Microsoft’s de-duplication technology work?
Introducing Windows 8 Deduplication
Let’s make one thing clear right from the start: Microsoft started from a clean sheet and invented their own deduplication technology. This is not a licensed, cloned, or copied feature as far as I can tell. There are some clever aspects to it, along with a few head scratchers for folks like me who’ve seen lots of different deduplication approaches.
Microsoft’s deduplication is layered onto NTFS in Windows 8, and will be a feature add-on for Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations. This is a pretty neat trick: As is typical for Microsoft, deduplication will be a simple, transparent feature.
Now let’s talk for a moment about what Windows 8 deduplication is not.
- It is a server-only feature, like so many of Microsoft’s storage developments. But perhaps we might see it deployed in low-end or home servers in the future.
- It is not supported on boot or system volumes.
- Although it should work just fine on removable drives, deduplication requires NTFS so you can forget about FAT or exFAT. And of course the connected system must be running a server edition of Windows 8.
- Although deduplication does not work with clustered shared volumes, it is supported in Hyper-V configurations that do not use CSV.
- Finally, deduplication does not function on encrypted files, files with extended attributes, tiny (less than 64 kB) files, or re-parse points.
Some Technical Details on Deduplication in Windows 8
Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.
After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.
There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.
Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.
New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.
Although I am intrigued by Microsoft’s new deduplication technology in Windows 8 server, I still have many questions about its usefulness and impact on performance. Concentrating duplicate data in the system volume makes sense from a technical perspective, but could lead to an I/O hotspot on the disk. This is especially true for external caching storage systems, since there is no integration between Microsoft deduplication and storage array features. I am particularly concerned about the use of deduplication with VHD files in Hyper-V, since it could eat up valuable system RAM and impact I/O performance.
If you would like to try Microsoft deduplication for yourself, I am happy to report that it is included in the developer preview of Windows 8 that is available on Dev Center. Here are a few commands to get you started, and read Rick Vanover’s post too!
Import-Module ServerManager Add-WindowsFeature -name FS-Data-Deduplication Import-Module Deduplication Enable-DedupVolume E: get-dedupvolume
Note: I am a Microsoft MVP and Microsoft briefs me on upcoming technologies under NDA. This post is based on a Microsoft briefing from November which was said at the time not to be covered by any NDA. All of this information could be gleaned by experimenting with the Windows 8 developer preview, but it’s much easier to just go to the source.
Gabriel Chapman says
No OS partition? #fail
Roger Luethy says
Thank you for the good overview on that topic. General i think it’s a good move to have deduplication done right in the Filesystem. (Get as near as possible to the source). As you stated the problem will be how they handle the underlying Storage and the Performance. Post processing proved as quite as drawback in other solutions.
Postprocessing helps deduplication performance, but I was disappointed to see how long it took for Windows to deduplication anything in my test machine. I kept trying to make it duplicate, and it had no effect until it “got around to it” on its own. I think a lot of folks are going to try this and be disappointed when nothing is de-duplicated. Then, months later after complaining, they’ll notice that they’re using much less space.
If the Windows 8 client operating systems supported deduplication, but not on the OS partition, it would be worthless. But I think most server implementations use more than one drive anyway. I imagine most servers will be able to use deduplication even with that limitation.
Gabriel Chapman says
I see this as a problem for hyper-v implementations where you are going full throttle with win8 VM’s. Craft all your OS partitions identical, and create subsequent App/Data vols. Dedupe of the OS partition would allow for some serious space savings and reduced RTO/RPO for backup/recovery. I only need one copy of cmd.exe, not 200 for all 200 servers, now multiply by the roughly 20GB that is the standard Win OS install these days.
I believe that you can de-dupe boot volumes of guest VM’s in Hyper-V, actually. As long as you’re not using CSV, which is a pretty big limitation admittedly.
It is disappointing though, since the operating system volume is by definition extremely redundant between machines. It would be nice to be able to de-dupe that as well in Hyper-V implementations
Maybe I am missing the benefit here.
If it is only a NTFS feature – and can only be used with Windows Server 8 – then what would the use case be?
Most enterprises use a dedicated storage array to for CIFS shares. I do not know many big shops that use Windows Server as their main file servers
If the blocks are deduped they are only done so for that Operating System (if I understand correctly) – I gather there is no dedupe across multiple operating systems.
The main problem I see – is that it is connected to the OS – and not something that can be shared among multiple OS’s – which renders it to be “a buzz word” and not a useful feature.
Scott Brickey says
I appreciate that MS is looking to provide dedup; I think such an effort is all but necessary for them to stay in the storage space (windows storage server).
That said, the primary reason my file server runs ZFS (on FreeBSD 8, which has no dedup support) is because of its strong RAID support at the OS. The server uses a combination of RAID levels depending on the size and relevance of data, including two and three disk mirrored arrays, RAID5, and RAID6.
The last time I tried Windows’ software RAID was back in the 2000 days, and found that my array got corrupt when moving it from one system to another (graceful power-off’s and all). That said, several people have suggested that they have had great experiences with it since then.
Have you heard anything about Microsoft’s interest in providing/continuing software RAID or some other type of multi-disk data redundancy (a la drobo or WHS)?
Scott Brickey says
you’d be surprised how often Windows Storage Server is the host OS for your NAS of choice.
Microsoft’s Dynamic Disk technology isn’t bad. It’s actually quite a lot like Symantec Storage Foundation (too much, in fact – Google that!)
Anyway, Dynamic Disk allows you to create mirrored and RAID-5 sets. It’s nowhere near as flexible as Drobo, but it works and is included in Windows.
Then you can layer Dedupe on top of these sets. Or you will be able to whenever Windows 8 Server (whatever it will be called) ships.
Microsoft also introduced something called Storage Spaces which sounds like a sequel to WHS’ flexible RAID but isn’t. I’m not impressed by that. I’ll write it up sometime.
Indeed, this will likely get used by the hordes of DAS-loving Windows servers out there. And there are a multitude of them. Although it’s true that many enterprises have a storage array that probably already does it’s own dedupe, I’d guess about half of storage in a datacenter is isolated on Windows servers. And this attacks that very large pool.
Then there’s the artist formerly called Windows Storage Server (now just an add-on for Windows Server). As Scott points out below, it’s everywhere – probably the #1 NAS OS on the planet. And this will work for all that capacity from all those vendors.
A bigger question is whether the Dells and HPs and LaCies of the world WANT deduplication on their Windows-powered “arrays”. After all, it’ll cause customers to buy fewer disk drives, right? Hahahaha!
By default, a file gets de-duplicated only if it has not been modified for 30 days or longer, and it is larger than 64KB. You can use: Set-DedupVolume to change the default. You can also use: Start-DedupJob to initiate a deduplication process on demand, instead of waiting for the default schedule to kick in.
Giovanni Coa says
My personal opinion is that it’s always a great challenge trying to enhance technologies, but I pesonally prefer maximum reliability for data and the most challenge for Microsoft will be to convince IT professional for the reliability of their dedup solutions.
Another way for Microsoft is to force NTFS to do dedup without the possibility to choose for the users.
At the moment the OS mirror or raid not convinces me enough to use instead a RAID controller with specialized firmware on board.
“The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data” does this mean that if we mount a VHD based volume (originally being deduped on a win8 box) onto a R2 box then the contents of the volume would still be consistent… reparse points in the file metadata would work on R2 but i guess dedupe filter driver is not present. It seems like the deduped contents would stilll be valid but subsequent de-dupe on R2 would not be possible..