ZFS Is the Best Filesystem (For Now…)

July 10, 2017 By Stephen 75 Comments

ZFS should have been great, but I kind of hate it: ZFS seems to be trapped in the past, before it was sidelined it as the cool storage project of choice; it’s inflexible; it lacks modern flash integration; and it’s not directly supported by most operating systems. But I put all my valuable data on ZFS because it simply offers the best level of data protection in a small office/home office (SOHO) environment. Here’s why.

The Prime Directive of storage: Do not return the wrong data!

The ZFS Revolution, Circa 2006

In my posts on FreeNAS, I emphatically state that “ZFS is the best filesystem”, but if you follow me on social media, it’s clear that I don’t really love it. I figured this needs some explanation and context, so at the risk of agitating the ZFS fanatics, let’s do it.

When ZFS first appeared in 2005, it was absolutely with the times, but it’s remained stuck there ever since. The ZFS engineers did a lot right when they combined the best features of a volume manager with a “zettabyte-scale” filesystem in Solaris 10:

ZFS achieves the kind of scalability every modern filesystem should have, with few limits in terms of data or metadata count and volume or file size.
ZFS includes checksumming of all data and metadata to detect corruption, an absolutely essential feature for long-term large-scale storage.
When ZFS detects an error, it can automatically reconstruct data from mirrors, parity, or alternate locations.
Mirroring and multiple-parity “RAID Z” are built in, combining multiple physical media devices seamlessly into a logical volume.
ZFS includes robust snapshot and mirror capabilities, including the ability to update the data on other volumes incrementally.
Data can be compressed on the fly and deduplication is supported as well.

When ZFS appeared, it was a revolution compared to older volume managers and filesystems. And Sun open-sourced most of ZFS, allowing it to be ported to other operating systems. The darling of the industry, ZFS quickly appeared on Linux and FreeBSD and Apple even began work to incorporate it as the next-generation filesystem for Mac OS X! The future seemed bright indeed!

Checksums for user data are essential or you will lose data: Why Big Disk Drives Require Data Integrity Checking and The Prime Directive of Storage: Do Not Lose Data

2007 to 2010: ZFS is Derailed

But something terrible happened to ZFS on the way to its coronation: Lawsuits, licensing issues, and FUD.

The skies first darkened in 2007, as NetApp sued Sun, claiming that their WAFL patents were infringed by ZFS. Sun counter-sued later that year, and the legal issues dragged on. Although ZFS definitely did not copy code from NetApp, the copy-on-write approach to snapshots was similar to WAFL, and those of us in the industry grew concerned that the NetApp suit could impact the future availability of open-source ZFS. And this appears to have been concerning enough to Apple that they dropped ZFS support from Mac OS X 10.6 “Snow Leopard” just before it was released.

Here’s a great blog about ZFS and Apple from Adam Leventhal, who worked on it: ZFS: Apple’s New Filesystem That Wasn’t

By then, Sun was hitting hard times and Oracle swooped in to purchase the company. This sowed further doubt about the future of ZFS, since Oracle did not enjoy wide support from open source advocates. And the CDDL license Sun applied to the ZFS code was judged incompatible with the GPLv2 that covers Linux, making it a non-starter for inclusion in the world’s server operating system.

Although OpenSolaris continued after the Oracle acquisition, and FreeBSD embraced ZFS, this was pretty much the extent of its impact outside the enterprise. Sure, NexentaStor and GreenBytes helped push ZFS forward in the enterprise, but Oracle’s lackluster commitment to Sun in the datacenter started having an impact.

What’s Wrong With ZFS Today

OpenZFS remains little-changed from what we had a decade ago.

Many remain skeptical of deduplication, which hogs expensive RAM in the best-case scenario. And I do mean expensive: Pretty much every ZFS FAQ flatly declares that ECC RAM is a must-have and 8 GB is the bare minimum. In my own experience with FreeNAS, 32 GB is a nice amount for an active small ZFS server, and this costs $200-$300 even at today’s prices.

And ZFS never really adapted to today’s world of widely-available flash storage: Although flash can be used to support the ZIL and L2ARC caches, these are of dubious value in a system with sufficient RAM, and ZFS has no true hybrid storage capability. It’s laughable that the ZFS documentation obsesses over a few GB of SLC flash when multi-TB 3D NAND drives are on the market. And no one is talking about NVMe even though it’s everywhere in performance PC’s.

Then there’s the question of flexibility, or lack thereof. Once you build a ZFS volume, it’s pretty much fixed for life. There are only three ways to expand a storage pool:

Replace each and every drive in the pool with a larger one (which is great but limiting and expensive)
Add a stripe on another set of drives (which can lead to imbalanced performance and redundancy and a whole world of potential stupid stuff)
Build a new pool and “zfs send” your datasets to it (which is what I do, even though it’s kind of tricky)

Apart from option 3 above, you can’t shrink a ZFS pool. Worse, you can’t change the data protection type without rebuilding the pool, and this includes adding a second or third parity drive. The FreeNAS faithful spend an inordinate amount of time trying to talk new users out of using RAID-Z1 ¹ and moaning when they choose to use it anyway.

These may sound like little, niggling concerns but they combine to make ZFS feel like something from the dark ages after using Drobo, Synology, or today’s cloud storage systems. With ZFS, it’s “buy some disks and a lot of RAM, build a RAID set, and never touch it again”, which is not exactly in line with how storage is used these days.²

Where Are the Options?

I’ve probably made ZFS sound pretty unappealing right about now. It was revolutionary but now it’s startlingly limiting and out of touch with the present solid-state-dominated storage world. So what are your other choices?

Linux has a few decent volume managers and filesystems, and most folks use a combination of LVM or MD and ext4. Btrfs really got storage nerds excited, appearing to be a ZFS-like combination of volume manager and filesystem with added flexibility, picking up where ReiserFS flopped. And Btrfs might just become “the ZFS of Linux” but development has faltered lately, with a scary data loss bug derailing RAID 5 and 6 last year and not much heard since. Still, I suspect that I’ll be recommending Btrfs for Linux users five years from now, especially with strong potential in containerized systems.³

On the Windows side, Microsoft is busy rolling out their own next-generation filesystem. ReFS uses B+ trees (similar to Btrfs), scales like crazy, and has built-in resilience and data protection features⁴. When combined with Storage Spaces, Microsoft has a viable next-generation storage layer for Windows Server that can even use SSD and 3D-XPoint as a tier or cache.

Then there’s Apple, which reportedly rebooted their next-generation storage layer a few times before coming up with APFS, launched this year in macOS High Sierra. APFS looks a lot like Btrfs and ReFS, though implemented completely differently with more of a client focus. Although lacking in a few areas (user data is not checksummed and compression is not supported), APFS is the filesystem iOS and macOS need. And APFS is the final nail in the coffin for the “ZFS on Mac OS X” crowd.

Each major operating system now has a next-generation filesystem (and volume manager): Linux has Btrfs, Windows has ReFS and Storage Spaces, and macOS has APFS. FreeBSD seems content with ZFS, but that’s a small corner of the datacenter. And every enterprise system has already moved way past what ZFS can do, including enterprise-class offerings based on ZFS from Sun, Nexenta, and iXsystems.

Still, ZFS is way better than legacy storage SOHO filesystems. The lack of integrity checking, redundancy, and error recovery makes NTFS (Windows), HFS+ (macOS), and ext3/4 (Linux) wholly inappropriate for use as a long-term storage platform. And even ReFS and APFS, lacking data integrity checking, aren’t appropriate where data loss cannot be tolerated.

Stephen’s Stance: Use ZFS (For Now)

Sad as it makes me, as of 2017, ZFS is the best filesystem for long-term, large-scale data storage. Although it can be a pain to use (except in FreeBSD, Solaris, and purpose-built appliances), the robust and proven ZFS filesystem is the only trustworthy place for data outside enterprise storage systems. After all, reliably storing data is the only thing a storage system really has to do. All my important data goes on ZFS, from photos to music and movies to office files. It’s going to be a long time before I trust anything other than ZFS!

RAID-Z2 and RAID-Z3, with more redundancy, is preferred for today’s large disks to avoid data loss during rebuild ↩
Strangely, although multiple pools and removable drives work perfectly well with ZFS, almost no one talks about using it that way. It’s always a single pool named “tank” that includes every drive in the system. ↩
One thing really lacking in Btrfs is support for flash, and especially hybrid storage. But I’d rather that they got RAID-6 right first. ↩
Though data checksums are still turned off by default in ReFS ↩

You might also want to read these other posts...

Comments

Paul Corneille says

July 11, 2017 at 6:02 am

For Small/Medium business ZFS is THE solution, since year’s now we save every month several TB of data: video footage, VFX RAW images and other heavy multimedia files on a FreeBSD server’s with ZFS in 2 different physical location, and we never lose 1bit of data and wend a HD fail we recover all data very swiftly.
Scott Armitage says

July 11, 2017 at 6:44 am

My money is on bcachefs (not mentioned in the article). It’s a clean-sheet design, and Kent is taking his time to make sure the codebase is rock solid before he merges it upstream. It’s features are a superset of ZFS, btrfs, and caching devices (and then some), and if Kent keeps up the good work it should have the reliability of ZFS.
Satadru Pramanik says

July 11, 2017 at 12:15 pm

In fairness there is an openzfs on macos community, and they’ve even gotten macos booting on zfs with some work, so if one is trying to choose between a new undocumented APFS or HFS+, there may be another option.

See https://openzfsonosx.org/
jp says

July 11, 2017 at 8:12 pm

it depends on the usecase honestly. if all you are doing is a media server or something then you can go with duplicated drivepool or drivepool and snapraid. or if you are a linux person mergerfs and snapraid. good enough for most people with a good local and offsite backup and a lot simpler than zfs
Mattia_98 says

July 12, 2017 at 7:32 am

I can confirm that ECC RAM can come in very handy. I have lost a pretty good amount of data because the RAM in my NAS has gone bad. First ZFS reported a few checksum fails but nothing too bad. Then it started to report files have been unrecoverably damaged.. I swapped the sticks and now it seems to work OK again. Sadly, I don’t have the money for another backup NAS so I need to take care as much as possible of the one I have ;D
sfoskett says

July 12, 2017 at 7:43 pm

Exactly. I wouldn’t trust my data to anything other than ZFS in the small office marketplace. I was a Drobo fan but they lost me (slow and expensive) and I’m definitely not going to trust Btrfs for a while to come!
sfoskett says

July 12, 2017 at 7:43 pm

Thanks for the pointer! I had not heard of bcachefs but now I’ll look it up!
sfoskett says

July 12, 2017 at 7:44 pm

I disagree. Today’s large hard drives absolutely need data checksumming to avoid unrecoverable bit error issues. That’s why I’m not a fan of ReFS or APFS at this point!
sfoskett says

July 12, 2017 at 7:45 pm

Yes! Without ECC RAM you might as well store without data checksums because you just don’t know what you’re going to get out. Happily, lots of Intel CPUs (even cheap ones) support ECC now with the right motherboard, so although the memory is expensive the CPU doesn’t have to be.
Conan Kudo (ニール・ゴンパ) says

July 12, 2017 at 7:46 pm

Good news is that there’s been a huge focus by Btrfs developers this year to fix all the remaining issues with the file system in regards to RAID and repair. The first iteration of this stuff was released with the 4.12 kernel (resolving the major RAID 5/6 issues), and further iterations are coming over the next few kernel releases.

Btrfs also has a working implementation for Windows, and an in-progress implementation for Haiku OS.
Yan Minari says

July 12, 2017 at 7:55 pm

Bonus points for having compression and checksum.
sfoskett says

July 12, 2017 at 8:07 pm

This is really good news. I am impressed by what Btrfs has promised and hope to see it become “the ZFS of Linux”. I’ll be watching!
Kenny says

July 12, 2017 at 8:15 pm

> multiple pools and removable drives

This is exactly how I’m using ZFS on my two (puny) storage servers. It works great!
npcomplete says

July 12, 2017 at 9:59 pm

I was hopeful every since I heard about ReFS years ago but it still doesn’t look promising with all the features enabled in Storage Spaces for its equivalence to ZFS. I’ve come across many complaints about performance and worse, reliability issues when something goes wrong (power outage/sudden reset/OS panic/disk failure/replacement)

I’m also hoping btrfs improves quickly. I don’t trust it now and I don’t trust Netgears use of btrfs on top of mdraid (because they too don’t trust the raid 5).
jp says

July 12, 2017 at 10:09 pm

But you don’t necessarily need real time checksumming. Snapraid will do checksumming regularly and that is good enough for a lot of usecases
George Michaelson says

July 12, 2017 at 11:12 pm

It never addressed glusterfs or ceph. Its single point only. the NFS export options are frankly confusing and dont get explained (ie, why not have /etc/exports instead since it seems you can need both)

Tuning a million badly understood options. Beginners read cluesheets and are lead up the garden path. Suddenly your entire 500GB memory box is thrashing and you don’t know why.

zfs send|receive pipes are the best thing eva ™ but.. why the lack of a network transport? SSH imposes costs which make it delay bound. mbuffer as a transport is only documented by hearsay on blogs.

recovery of stupid things ™ like shutdown without unmount/export and then a new OS.. its bizarre that you can use zfs/zpool commands to say “go find it” but its treated like a big secret. ITS ON THE DISKS MAN
Adam Baxter says

July 13, 2017 at 12:06 am

Can you elaborate? I’d like to plug in 2x 8TB removable drives without risking my main pool.
Rob says

July 13, 2017 at 8:53 am

Apple has promised to release an official APFS spec, though they didn’t indicate when it should be expected. My best guess would be about a year from now, as a sort of “version 2”, similar to what they did with Swift.
Rob says

July 13, 2017 at 8:55 am

There are even a few Atoms that support ECC.
Rob says

July 13, 2017 at 8:59 am

Apple’s excuse:

> To protect data from hardware errors, all Flash/SSD and hard disk drives used in Apple products use Error Correcting Code (ECC). ECC checks for transmission errors, and when necessary, corrects on the fly.

What I’m not sure of is what proportion of errors that will prevent.
Jesper Monsted says

July 13, 2017 at 9:38 am

Just create a second pool and mount it wherever you want. There’s nothing magic about it.

My setup is a fast and safe (raidz2 on many drives) pool for active stuff and a slower and less safe (raidz1 on large drives but backed up) pool for archival purposes.
Max Mortillaro says

July 13, 2017 at 10:00 am

You had me at Haiku OS.
Kim ALLAMANDOLA says

July 13, 2017 at 12:33 pm

Mh, IMVHO today we need a “modern data storage solution” like zfs: we need in operation to send&receive entire deploys (BE/Boot Environments), we need zones (non crappy lxc/d), we need modern pkg-manager (able to integrate with the underline storage better than a traditional posix fs access) and in turn modern installers (for instance to easily deploy a set of remote machine without the nightmare of preseed, the spaghetti code of FAI or the iron-bar flexibility of LinuxCOE).

Zfs for me was and still is a pioneer, unfortunately the era of “operations tech” of the old UNIX company is lost in the past and today nobody seems to been able to really evolve.

Sorry for my poor English.
ksec says

July 13, 2017 at 6:59 pm

I think APFS really is a consumer, and NAND Flash only focused File System. Where Error Correction on the NAND Die itself as well as Controller will replace where was done on the software side.

But HDD? I wouldn’t want APFS on HDD if i care about Data reliability.
Rob says

July 13, 2017 at 7:20 pm

Yeah, it’s absolutely not intended for enterprise applications. I doubt you’d be able to get ZFS to run on a wristwatch. 😀
Juana de Arco en Waterloo says

July 14, 2017 at 5:07 am

” It’s laughable that the ZFS documentation obsesses over a few GB of SLC flash when multi-TB 3D NAND drives are on the market”

Because SLC flash is more reliable, is faster and provides more endurance. To prefer a few grams of pure gold vs a ton of lead is not laughable. If you are talking about cache and ZIL then SLC is obviously the best option. Of course, MTL drives are also a great option and much more cost effective.
sfoskett says

July 14, 2017 at 3:29 pm

My point is not that SLC is useless but that ZFS has no way to create a real hybrid pool with the cheap and abundant large SSDs available today. Personally, I have a flash pool and a disk pool and manually place data on each depending on my expectations of performance vs. capacity. This is a decidedly old-school approach to storage management. I would love to see something better implemented in ZFS in the future, such as true hybrid pools or perhaps just a better caching mechanism than L2ARC.
Juana de Arco en Waterloo says

July 14, 2017 at 6:30 pm

Actually with Apple fusion I prefer fission That hybrid cache mechanism is not that smart as I wish and I found that manual management of files can be more efficient, at least for me. But you are right and I get your point, L2ARC is not useful for the kind of smart cache hybrid mechanism that you want. An implementation for persistent L2ARC is coming (someday), which somewhat approaches the hybrid smart cache solution. However, bear in mind that hybrid solutions are also prone to more failure, similar to RAID-0, if one device fails, everything is lost. You don’t get that risk with L2ARC and dedicated ZIL and yet performance can be improved significantly. Another solution is to replace all the disks with flash, since large SSD are cheap and abundant. Eventually our disks will be obsolete anyway,
sfoskett says

July 14, 2017 at 7:24 pm

I see no reason that hybrid has to impact availability any more than any other striped pool. We should be able to add redundant SSDs (a mirror or RAIDZ) to match the HDDs and have it move data between them. This is how enterprise storage systems do it. Wouldn’t it be good? I definitely would not add a single SSD any more than I would add a single HDD (which goes to my point about flexibility above).
Juana de Arco en Waterloo says

July 14, 2017 at 11:54 pm

Well, ZIL and L2ARCH have nothing to do with availability. If fact they don’t touch issues such as redundancy and availability. They are only concerned with performance. Which is a good thing. I mentioned Apple Fusion (stripped hybrid ) precisely to highlight the fact that it will touch not only performance but also failure rate, which would make other issues even more relevant such as backups and mirrors. That only complicates things even further for data integrity. So, what if I only want to touch performance? To me it looks that I get more value with a dedicated flash drive for ZIL and L2ARC, while at the same time retaining other options for availability and redundancy if I want to. How can I get that feature with other filesystem?
Bart Smaalders says

July 15, 2017 at 12:49 am

Also, new AMD Ryzen CPUs support ECC w/ the right mobos,
Bart Smaalders says

July 15, 2017 at 12:52 am

If you don’t checksum the data when you read it off the drive and compare it to the checksum in the block ptr, you’ll drive on w/ bad data. Not necessarily a problem for your home video server… but could be painful if you’re printing checks, or doing something else where error matter.
jp says

July 15, 2017 at 4:42 am

agreed there are usecases that benefit tremendously from zfs but other usecases get relatively little benefit and are constrained by its inflexibility and comparative complexity. This make zfs largely impractical for nearly anywhere without a full time IT guy. of course most people who would even consider it outside of business are hobbyist and we aren’t exactly known for being practical.
Paul Vixie says

July 16, 2017 at 12:15 am

“And no one is talking about NVMe …”

huh. i was just talking about NVME and tiered hierarchical storage to make appropriate use of solid state disks in ZFS, about a week ago. which would be meaningless since i am not a ZFS developer, except, the person i was talking to was marshall kirk mckusick, original author of FFS/UFS, now a freebsd kernel dev, and he had a Gleam in his Eye. i want my spinning rust gone and my power bills lowered. when NVME offers memory-mappable persistent storage, i want my file system to take advantage of it. mmap should mean no block-level data moves.

bcachefs sounds like it may be the cool thing for linux, but i’m not expecting to run anything other than freebsd and bhyve at my physical layer, any decade soon. i’m exceptionally pleased and proud that freebsd went all-in on ZFS, and i am looking forward to whatever they come up with next.

by the way, dedup on ZFS on freebsd is a disaster, should be removed from the system altogether, it will crap on your metadata, you will have to use bit tweezers to recover your data. Just Say No to dedup!
Tim says

July 16, 2017 at 6:55 am

I’m a consistent user of mbuffer for zfs send/receives and it works great. I’ve experimented with glusterfs nodes using ZFS storage for the backends and I’ve had zero issues. Though glusterfs always feels um.. rinky dink.
Hugh Briss says

July 17, 2017 at 3:30 am

ZFS is way over-hyped and its fanatical worshippers need to go away. The whole “ZFS has data CRCs so it preserves your data integrity!” thing that is repeated in every “why ZFS?” post everywhere only shows the gross technical ignorance of the people using ZFS. Hard drives already store and verify all data with ECC. This is how the “reallocated sectors” and “uncorrectable errors” mechanisms work. All modern hard drives provide better protection against bit rot than ZFS, so ZFS CRCs are just more redundant garbage.

OK, that’s not 100% true. There can be errors in RAM, physical data layers like SATA and Ethernet PHYs (both of which have minimal error detection checksums), and any other component that is responsible for holding the data temporarily at any point. In such an extremely rare case ZFS stands a chance of detecting the error. Unfortunately, unless you’re using ZFS with a RAID implementation that ZFS is friendly with, detecting the error only means you know what you’ve lost. Without all of those specific conditions being met, the integrity arguments in favor of ZFS is useless. Most people aren’t using ZFS in a configuration that enables data recovery when a data CRC error is found.

As far as raw Linux filesystems go, XFS is superior to all other available options. The data loss bugs that used to scare people off were fixed ages ago. XFS even has the same block-level deduplication as BTRFS now. With every month that passes ZFS becomes even less relevant. I don’t understand why anyone uses it when the available (and natively supported) alternatives are clearly superior.
anon says

July 17, 2017 at 12:28 pm

ZFS on Solaris 11 already supports persistent L2ARC and works perfectly fine with large NVMes. It also stores raw blocks, so if you have compression enabled then entire l2arc is compressed as well.
underoverlay says

July 18, 2017 at 10:21 am

I use XFS on OSs that support it. Why?

– maturity
– support in stock kernels
– consistent performance
– overall performance is plenty good enough
– relatively easy to understand (compared to e.g. ZFS)

see https://en.wikipedia.org/wiki/XFS for more.

HAMMER2 has great potential, but porting efforts are scarce…

Keeping an eye on CEPH’s interesting scaling potential these days as well…
Allan Jude says

July 19, 2017 at 3:37 am

the SLC devices are mentioned for their write endurance. You don’t need a large SLOG, it will never be used.

A hybrid-ish system is coming, see Intel’s work on Allocation Classes.
TimK says

July 25, 2017 at 6:13 pm

BBCP and GridFTP make good backend transports for ZFS send/receive, too.
LD says

July 27, 2017 at 10:15 am

You mentioned that ReFS is not appropriate (as in “trust your data to it”) since it is lacking data integrity checking. I never tried ReFS, but quick search found this: https://docs.microsoft.com/en-us/windows-server/storage/refs/integrity-streams

In short, it appears you can “turn on” integrity checks on ReFS on file/folder or volume level. It will have some impact on performance, but I bet integrity checks also have performance impacts on btrfs/ZFS etc.
Michael Rose says

September 3, 2017 at 12:21 pm

You are so full of shit your eyes ought to be brown.data corruption still happens its still useful to be able to know which copy of a block is garbage and replace it with the one that passes checksum, or at least restore the relevant file or files from backup.
Hugh Briss says

September 3, 2017 at 4:25 pm

You are welcome to your uninformed opinion. I operate in the world of facts. Are you using RAID-Z? If not then you literally cannot “replace a garbage block with one that passes checksum.” Where does the data corruption happen? How often does it happen? The only place I have seen anyone mention corruption happening that makes sense is a defective RAM chip on a hard drive circuit board. If your drive is corrupting data in this way, it’ll corrupt the filesystem itself too.

I’m sorry that you don’t really understand what you’re talking about.
Kil says

September 20, 2017 at 2:19 pm

Yep. Right now XFS on top of dmraid is still THE way to go. I have been running large production stable XFS systems this way for over 15 years.

I do have a number of BTRFS systems on top of dmraid (NOT BTRFS RAID, it sucks) and they have been mostly stable but very poor performing both in terms of speed and space efficiency (especially fragmentation for iscsi and virtual machine files). The only reason I ever used BTRFS was for its ability to make lightweight copies. Now that XFS supports extent-same, there is no reason to use BTRFS any more. Which is a good thing because BTRFS continues to make me nervous with so many gotcha’s lurking. Maybe one day it will be stable but that won’t make it perform better.

That bcachefs mentioned above looks interesting, I’m going to take a look at it.
Badtux says

September 28, 2017 at 1:53 am

I’m not sure that tiered hierarchical storage is going to be a “thing” in the future, now that 2tb SSD’s are shipping for $550 from Crucial and will likely be coming down significantly in price over the next few years (I remember when the 500gb SSD’s were $550, now only four years later you can get them for $150). All-flash storage systems are still the exception rather than the rule but I doubt that will be the case four years from now.

BTRFS has been under continuous development for ten years. By the time ZFS had been under development for ten years, it was rock solid stable and the default filesystem for Solaris and widely used under FreeBSD. But I regularly see changesets go across the git wire for BTRFS that make me say, “what the (bleep)?!”, fixing stupid things that should have never made it into the codebase in the first place. Part of the problem is that they’re going through a Linux buffer cache layer that was never designed for something like BTRFS, while ZFS has its own buffer cache layer that was designed from the beginning to support ZFS. A lot of the problems seem to be bad interactions between BTRFS, write barriers, the generic Linux buffer cache layer, and the block driver layer underneath the buffer cache layer. The Linux guys are trying to support every filesystem under the sun with a single buffer cache layer, so BTRFS relies on a set of hacks to that layer, but hacks on top of hacks is no way to get reliability. Another problem is the same problem that ReiserFS faced — btrees are a lousy way to deal with unreliable storage media unless you’re including a lot of redundancy allowing the trees to be rebuilt if they get smashed. ReiserFS was infamous for eating filesystems. So is BTRFS. Coincidence? I suspect not.

A shame. I just put together a new storage system. I would have loved to have used BTRFS. It fits into the Linux environment much better than ZFS (for obvious reasons), and it’s much easier to manage a pool of BTRFS drives. But it just isn’t stable. It isn’t. And that annoys the bleep out of me.
Néstor C. says

October 31, 2017 at 10:48 am

If the problem is on the metadata, there is 2 copies on far away positions of the same disk.

By the way, i’m using raidz. ECC memory is not mandatory but recommendable not only for zfs.
Chris Collins says

November 27, 2017 at 3:44 am

I think the killer with REFS is that it doesnt support compression which is a bizarre design choice, cpu cycles are more readily available than i/o today. Not to mention the space compression saves as well. The reason ZFS is loved by so many is that it is light years ahead of other free filesystems. Gone are the days where the only thing that mattered was performance, now the ability to snapshot data, auto checksumming, dynamic resizing of pools, data integrity are all more important than performance.
Alex Ellis says

February 7, 2018 at 8:44 am

Good blog post Stephen – I have a HP Proliant Microserver N40L which only has 2GB of RAM – it will technically run ZFS but goes against the common advice of 1GB RAM / 1 TB storage. The CPU is constraining anything like SCP but FTP achieves high speed and iperf full gigabit speed. Copying up an ISO over SFTP maxes out the CPU which probably shows this box needs retiring.
Hugh Briss says

April 27, 2018 at 11:58 am

You claim they’re “garbage” yet you have still failed to address a single one of them. How convenient.

“Modern drives make extensive use of error correction codes (ECCs), particularly Reed–Solomon error correction…Only a tiny fraction of the detected errors ends up as not correctable.” And covering the ONLY case that ZFS CRCs are usable for: “The worst type of errors are silent data corruptions which are errors undetected by the disk firmware or the host operating system.” https://en.wikipedia.org/wiki/Hard_disk_drive#Error_rates_and_handling

“Many processors use error correction codes in the on-chip cache” https://en.wikipedia.org/wiki/ECC_memory#Cache

“It automatically repairs the damage, using data from the other mirror, assuming checksum(s) on that mirror are OK.” Auto-repair requires a ZFS native mirror, thus RAID-Z. Without it, ZFS can only go “lost data, sorry!” https://blogs.oracle.com/timc/demonstrating-zfs-self-healing

And here’s a guy who addresses people like you in the section “Problems with ZFS Cult-Like Following” https://www.datamation.com/data-center/the-zfs-story-clearing-up-the-confusion-2.html

“This cult-like following and general misunderstanding of ZFS leads often
to misapplications of ZFS or a chain of decision making based off of
bad assumptions that can lead one very much astray.”

But YOUR response to these FACTS is to behave like a bratty child, declare yourself superior, and stomp off to be butthurt in a corner of the playground. Way to go! Sounds like you’re a bit “strong in faith, weak on knowledge” as a faux wise man once said to me.
Skepsist says

April 27, 2018 at 10:21 am

You really have no idea what you’re talking about. Either you have no real experience with ZFS, or you’re simply unable to see the difference between gold and stone.
Hugh Briss says

April 27, 2018 at 10:57 am

Uh, yeah, I know what I’m talking about. Don’t blame your Dunning-Kruger on me.
Skepsist says

April 27, 2018 at 11:29 am

If you know what you’re talking about, you’re not convincing anyone by writing highly emotional garbage posts. Just an advice.
Hugh Briss says

April 27, 2018 at 11:32 am

You’re projecting. I stated clear facts that anyone can spend 30 seconds searching for and verifying. Your response to them was “nu-uh, you’re stupid!” Who’s the emotional garbage poster again? Introspection is an important part of adulting; perhaps you should try it sometime.
Skepsist says

April 27, 2018 at 11:42 am

I never called you stupid. I’m just saying you have no idea what you’re talking about. I’m sorry, but your “facts” are garbage.

When people come off like you (strong in faith, weak on knowledge), people that could have helped you understand the subject better will not bother wasting their time. I’ve wasted enough time on you as it is.
Jason says

August 9, 2018 at 1:29 am

I really disagree with this blog post. It’s complaining about stuff that really isn’t true, most of us have moved past long ago, is a bit petty, and ignores all the great strides being made. Dedup? Stop already. ZFS can make use of nvme and ppl have been using and talking about it for years now. Sure, tiered storage isn’t really a thing with ZFS although it is possible. But it wasn’t designed for that and really isn’t necessary if you set things up the way you should. Removal and expanding zraid is finally coming and was in the works slrealr by matt ahrens. Native ZFS encryption is in ZoL now as well. Compressed ARC. Send/rec resume. Faster snapshot deletion. And the list goes on.

Honestly it sounds like you just don’t see the benefits of what the OpenZFS group HAS been doing with ZFS. Because it’s even better for real use today then back then and ZoL has been a big boost.

No, it isn’t a distributed filesystem. ZFS was never and will never be that. If that’s what you want you are better off looking into laying fluster or ceph. Or perhaps look at the great stuff being done with luster+ZFS which are now heavily integrated since luster realizes how awesome ZFS is and didn’t try to redo that. Although I believe luster is more a parallel then distributed system.
Jason says

August 9, 2018 at 1:33 am

If the SSD is so cheap for storage you might as well just replace all that spinning disk. If it’s not that cheap, perhaps you need to rethink your pool layouts.
Jason says

August 9, 2018 at 1:36 am

It’s very clear from your post you don’t have a clue about how storage systems work at the low level. Your ignorance is astounding.
Hugh Briss says

August 9, 2018 at 2:16 am

I write filesystems. You are objectively wrong. Making zero points is not a counter-argument, either. Sorry that you’re butthurt.
underoverlay says

September 12, 2018 at 5:26 pm

A year later, I’m learning to stop worrying & love ZFS.

XFS is still my preferred choice to build simple/small systems (when you have to use Linux…) with a single drive that do not require high availability, volume portability, etc.

However, as we migrate systems from LXC to LXD, LXD uses ZFS datasets as its preferred native container storage backend, so if you want to use learn LXD operations you’d better learn ZFS.

It was easier than I thought it would be.

& it is quite powerful. It certainly has the limitations addressed in this article, but if you’re aware of them & design your systems with them in mind from the beginning, then you’re in pretty good shape.

I think the single biggest difference vs. most other filesystems is that e.g. ext4 is:

– a filesystem

& you still need

– a volume manager

– a mirroring/RAID framework

whereas ZFS is all 3 of these things in a single tightly integrated framework.

Just mirror your zpool across identical partitions on two devices. You don’t need RAID. Really. http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/

If you have spinning rust or slow SSDs, grab an Optane drive (or two) & put your SLOG on it (and swap partition :-). Seriously. Even a <$50 16GB Optane with 8GB SLOG & 8GB swap will massively enhance the performance of any system without at least current-generation enterprise-grade NVMe SSDs (e.g. a quick-n-dirty SQLIte benchmark showed a ~3x speedup on Optane 800p vs. a single desktop NVMe drive). & Optane's latency is low enough that you can use it as swap space without massive performance penalties, which partially mitigates ZFS's ram-hungryness. https://www.servethehome.com/exploring-best-zfs-zil-slog-ssd-intel-optane-nand/

That's really all you need for virtually any high-availability/high-performance setup short of mission-critical systems that require better than N+1 redundancy & data integrity.
M.Peter says

December 20, 2018 at 11:04 am

LOL “One thing really lacking in Btrfs is support for flash, and especially hybrid storage.” – where did you get this from? Btrfs was designed with flash in mind, so what are you talking about?

Also, even the main developer and maintainer of ext4 says: Its the future. https://arstechnica.com/information-technology/2009/04/linux-collaboration-summit-the-kernel-panel/
Marc Oggier says

May 30, 2019 at 4:56 pm

Stephen, I was wondering if you still stand behind zfs now ?
Marc Oggier says

May 30, 2019 at 4:57 pm

I was wondering if you still stand behind zfs nowadays (mid-2019)?
Marc Oggier says

May 30, 2019 at 4:57 pm

I was wondering if you still stand behind zfs nowadays ?
bladerunner6978 says

July 22, 2019 at 1:08 pm

For-profit-corps like Apple, IBM/Redhat, Microsoft, Oracle, … never ever really liked opensource, unless they can hide and profit from it.
But no one can hide from the great (Open)ZFS, -thanks to the now defunct Sun Microsystems.

Linus is still trying to repeat ZFS with his lowly Btrfs, Microsoft just hates anything that they can’t patent to make money from, and Apple copied/stole ZFS, -and now calls it APFS-for-encumbered-Profit,
ZFS is here to stay, (and even IBM knows this), and if everyone would just jump onboard we would have a future-proof-multi-functional ZFS, that ALL User’s can benefit/share/grow from, no matter the OS.
But lol, that’s sounds too much like fair globalization eh ???
bladerunner6978 says

July 22, 2019 at 1:15 pm

Linux, like Apple, had a chance to implement and grow with ZFS, but instead, -’cause of lowly pride, one chose/copied Btrfs, and the other copied/stole (Open)ZFS for profit, and called it APFS.
lol.
bladerunner6978 says

July 22, 2019 at 1:17 pm

But Apple is not fair,’
and today, your point is proven moot, regarding any ZFS on Apple.
bladerunner6978 says

July 22, 2019 at 1:19 pm

ZFS Is (STILL), the Best Filesystem.
Kjeld Schouten-Lebbing says

October 28, 2019 at 1:38 pm

FilesystemS, multiple?
Name 2 that are actually enterprise grade.

I doubt you wrote them, at most you have a few PR’s in. If you actually wrote them you would be one of the worlds best filesystem designers.

Instead of stacking hypothesis why you might be that “uber” developer, I will apply Occam’s Razor and make the hypothisis you are just a rando on the internet thats full of shit.
Kjeld Schouten-Lebbing says

October 28, 2019 at 2:01 pm

@sfoskett:disqus You know there exists software to create hierarchical storage right? That could just as well use 2 underlaying ZFS volumes as harddrives. But okey, maybe you would fancy it in zfs.

and instead of making general “anti-zfs” statements which you don’t back up, maybe just explain WHY you think that something isn’t right… Like you do here:
“perhaps just a better caching mechanism than L2ARC.”

You continue to fall into making statements you don’t back up.

And don’t get me started about the shit ton of “not understanding before writing” in your article.
Things like:
1. No one talking about NVME (wrong, NVME pools are being done and Optane is quite the topic)
2. Need to replace all pool drives to expand. Plain wrong, a pool can include multiple different size vdevs and every vdev can be expaned by replacing all the drives in a vdev
3. “add a stripe on another set of drives”, thats just a mixmash of words that shouldn’t be in that sentence let alone in that order. You don’t “add a stripe” (I sincerely hope you dont add a stripe vdev to a pool), you add a vdev to a pool/stripe. “Putting a stripe on a set of drives” means putting raid 0 on them.

And ending with “as sad as it makes me”, is quite annoying. The only argument you made was the lack of hierarchical storage. So ZFS is not the solution for your specific wishes, thats fine. it simply means there doesn’t exist a tool for your project or at least ZFS isn’t that tool.
Simply put: You blame a flathead screwdriver to be bad at screwing in/out Philips screws.

ZFS is meant to create arrays of disks, caching, transfer and checksumming. Hierarchical storage is/was never part of the ZFS design. But one could relatively easy put any form of hierarchy ontop of zfs pools. Whats next, blaming every other filesystem to not include hierarchical storage?

It’s a filesystem, not something purposefully designed to fix your personal hierarchical storage problem.
Hugh Briss says

October 28, 2019 at 10:26 pm

Oh look, four paragraphs of vomit. I’m not DuckDuckGo’ing your name, loser. I don’t care about how big you think your tech nuts are. I never said I was a “file system specialist.” You’re making shit up out of thin air. Tell yourself whatever makes you feel better. I really don’t care what you think, and anyone reading your comment can easily glean how much of a massive dick you are without my help. You are wasting my time with your hurt ego. Don’t bother responding; I’ll only leave snarky replies at this point.
Hugh Briss says

October 28, 2019 at 10:01 pm

I’m not unveiling my anonymity for you, kiddo. Suffice it to say, yes, I’ve implemented filesystems from scratch. Enterprise-grade? Literally no single person has implemented such a thing solo, at least not by any modern notion of what “enterprise-grade” means. If you want to apply Occam’s Razor, apply it somewhere that doesn’t make you sound like a fool. The response was to someone who has no clue what they’re talking about. As you have demonstrated zero knowledge about anything technical whatsoever in your post, we can apply Occam’s Razor to you and determine that–while I have demonstrated a measurable quantity of knowledge in this thread–you are, in fact, the one that IS “full of shit.”
Kjeld Schouten-Lebbing says

October 28, 2019 at 10:22 pm

As you are not able to provide any proof or arguments that support your statement about being a file system specialist. Not even the simple mention of the specific filesystems you are said to be involved around and this being the internet, we could safely assume you are lying.

“As you have demonstrated zero knowledge about anything technical whatsoever in your post”
– Indeed I have not, unless you tactically ignored my rant about you not understanding even the basic different between a harddrive-fixable CRC mismatch and a URE, including the statistical likelyhood of URE’s on big arrays of drives. Yes in that case I have not demonstrated any knowledge about anything technical what so ever.

Considering I actually have demonstrated at least a measurable quantity of knowledge here and elsewhere (which could be easily traced back to me with a slight whip of the keyboard), the most simply explanation would be us both having at least a decent understanding of the mater. Although seemingly that doesn’t include Harddrive-failure-Statistics, and the difference between URE’s and recoverable-errors.

As you need to hide behind anonymity and seem to have the balls to keep refering to other with derogatory names, I think you get quite the trill from your trolling. Sadly enough most of the world views such an attitude as a show of weakness, not strength.
bernstein says

April 5, 2020 at 1:47 am

i suggest you try out bcache, it works below zfs. no need for zfs to reinvent the wheel.
bernstein says

April 5, 2020 at 1:49 am

not necessarily, if you run a raid(z) you already have checksumming… and checksumming without parity information doesn’t do much good
lsatenstein says

August 15, 2021 at 4:07 pm

For the past year and a half, I have been using zfs with Ubuntu. I am more than satisfied with its stability, its’ reliability and its’ performance.

One concern I have, which is probably due to my lack of knowledge, is handling USB plugged zfs formatted hard drives. While UBUNTU seems to support zfs as one drive or as a raid group of drives, other drives do not appear to be supported as zfs plug-ins. I would like to know how or if I could mount an external zfs formatted drive vis USB, or other attachment.

Currently I have zfs on a 100gig ssd. I was considering doing a test where I install a second copy of UBUNTU with zfs on a totally separate drive. Will I be able, given both drives are on the same system, mount the other system’s drive for RW access?

What I do like with UBUNTU and zfs is the /etc/fstab allows me to add btrfs partitions to my setup.
I keep a /Development partition, a /Backup partition, a /Music partition and a /share partition on separate disks. Could zfs eventually support large partitions in lieu of full disks?