I don’t usually excerpt large amounts of text from other blogs. But this is just too cool. UNIX nerds and Mac OS X weenies alike will either shake their heads and jump out a window or laugh out loud at one of the under-reported changes in Snow Leopard.
See, Snow Leopard’s version of HFS+ allows per-file compression using three very creative filesystem hacks. I’ll let John Siracusa from Ars Technica take the story from here, and I urge you to read John’s complete (and very, very long) Snow Leopard review!
In Snow Leopard, other kinds of files climb on board the compression bandwagon. To give just one example, ninety-seven percent of the executable files in Snow Leopard are compressed. How compressed? Let’s look:
% cd Applications/Mail.app/Contents/MacOS
% ls -l Mail
[email protected] 1 root wheel 0 Jun 18 19:35 Mail
Boy, that’s, uh, pretty small, huh? Is this really an executable or what? Let’s check our assumptions.
% file Applications/Mail.app/Contents/MacOS/Mail
Yikes! What’s going on here? Well, what I didn’t tell you is that the commands shown above were run from a Leopard system looking at a Snow Leopard disk. In fact, all compressed Snow Leopard files appear to contain zero bytes when viewed from a pre-Snow Leopard version of Mac OS X. (They look and act perfectly normal when booted into Snow Leopard, of course.)
So, where’s the data? The little “@” at the end of the permissions string in the ls output above (a feature introduced in Leopard) provides a clue. Though the Mail executable has a zero file size, it does have some extended attributes:
% xattr -l Applications/Mail.app/Contents/MacOS/Mail
0000 00 00 01 00 00 2C F5 F2 00 2C F4 F2 00 00 00 32 …..,…,…..2
0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …………….
(184,159 lines snipped)
2CF610 63 6D 70 66 00 00 00 0A 00 01 FF FF 00 00 00 00 cmpf…………
2CF620 00 00 00 00 ….
0000 66 70 6D 63 04 00 00 00 A0 82 72 00 00 00 00 00 fpmc……r…..
Ah, there’s all the data. But wait, it’s in the resource fork? Weren’t those deprecated about eight years ago? Indeed they were. What you’re witnessing here is yet another addition to Apple’s favorite file system hobbyhorse, HFS+.
At the dawn of Mac OS X, Apple added journaling, symbolic links, and hard links. In Tiger, extended attributes and access control lists were incorporated. In Leopard, HFS+ gained support for hard links to directories. In Snow Leopard, HFS+ learns another new trick: per-file compression.
The presence of the com.apple.decmpfs attribute is the first hint that this file is compressed. This attribute is actually hidden from the xattr command when booted into Snow Leopard. But from a Leopard system, which has no knowledge of its special significance, it shows up as plain as day.
Even more information is revealed with the help of Mac OS X Internals guru Amit Singh’s hfsdebug program, which has quietly been updated for Snow Leopard.
% hfsdebug /Applications/Mail.app/Contents/MacOS/Mail
compression magic = cmpf
compression type = 4 (resource fork has compressed data)
uncompressed size = 7500336 bytes
And sure enough, as we saw, the resource fork does indeed contain the compressed data. Still, why the resource fork? It’s all part of Apple’s usual, clever backward-compatibility gymnastics. A recent example is the way that hard links to directories show up–and function–as aliases when viewed from a pre-Leopard version of Mac OS X.
In the case of a HFS+ compression, Apple was (understandably) unable to make pre-Snow Leopard systems read and interpret the compressed data, which is stored in ways that did not exist at the time those earlier operating systems were written. But rather than letting applications (and users) running on pre-10.6 systems choke on–or worse, corrupt through modification–the unexpectedly compressed file contents, Apple has chosen to hide the compressed data instead.
And where can the complete contents of a potentially large file be hidden in such a way that pre-Snow Leopard systems can still copy that file without the loss of data? Why, in the resource fork, of course. The Finder has always correctly preserved Mac-specific metadata and both the resource and data forks when moving or duplicating files. In Leopard, even the lowly cp and rsync commands will do the same. So while it may be a little bit spooky to see all those “empty” 0 KB files when looking at a Snow Leopard disk from a pre-Snow Leopard OS, the chance of data loss is small, even if you move or copy one of the files.
The resource fork isn’t the only place where Apple has decided to smuggle compressed data. For smaller files, hfsdebug shows the following:
% hfsdebug /etc/asl.conf
compression magic = cmpf
compression type = 3 (xattr has compressed data)
uncompressed size = 860 bytes
Here, the data is small enough to be stored entirely within an extended attribute, albeit in compressed form. And then, the final frontier:
% hfsdebug /Volumes/Snow Time/Applications/Mail.app/Contents/PkgInfo
compression magic = cmpf
compression type = 3 (xattr has inline data)
uncompressed size = 8 bytes
That’s right, an entire file’s contents stored uncompressed in an extended attribute. In the case of a standard PkgInfo file like this one, those contents are the four-byte classic Mac OS type and creator codes.
% xattr -l Applications/Mail.app/Contents/PkgInfo
0000 66 70 6D 63 03 00 00 00 08 00 00 00 00 00 00 00 fpmc…………
0010 FF 41 50 50 4C 65 6D 61 6C .APPLemal
There’s still the same “fpmc…” preamble seen in all the earlier examples of the com.apple.decmpfs attribute, but at the end of the value, the expected data appears as plain as day: type code “APPL” (application) and creator code “emal” (for the Mail application–cute, as per classic Mac OS tradition).
You may be wondering, if this is all about data compression, how does storing eight uncompressed bytes plus a 17-byte preamble in an extended attribute save any disk space? The answer to that lies in how HFS+ allocates disk space. When storing information in a data or resource fork, HFS+ allocates space in multiples of the file system’s allocation block size (4 KB, by default). So those eight bytes will take up a minimum of 4,096 bytes if stored in the traditional way. When allocating disk space for extended attributes, however, the allocation block size is not a factor; the data is packed in much more tightly. In the end, the actual space saved by storing those 25 bytes of data in an extended attribute is over 4,000 bytes.
But compression isn’t just about saving disk space. It’s also a classic example of trading CPU cycles for decreased I/O latency and bandwidth. Over the past few decades, CPU performance has gotten better (and computing resources more plentiful–more on that later) at a much faster rate than disk performance has increased. Modern hard disk seek times and rotational delays are still measured in milliseconds. In one millisecond, a 2 GHz CPU goes through two million cycles. And then, of course, there’s still the actual data transfer time to consider.
Granted, several levels of caching throughout the OS and hardware work mightily to hide these delays. But those bits have to come off the disk at some point to fill those caches. Compression means that fewer bits have to be transferred. Given the almost comical glut of CPU resources on a modern multi-core Mac under normal use, the total time needed to transfer a compressed payload from the disk and use the CPU to decompress its contents into memory will still usually be far less than the time it’d take to transfer the data in uncompressed form.
That explains the potential performance benefits of transferring less data, but the use of extended attributes to store file contents can actually make things faster, as well. It all has to do with data locality.
If there’s one thing that slows down a hard disk more than transferring a large amount of data, it’s moving its heads from one part of the disk to another. Every move means time for the head to start moving, then stop, then ensure that it’s correctly positioned over the desired location, then wait for the spinning disk to put the desired bits beneath it. These are all real, physical, moving parts, and it’s amazing that they do their dance as quickly and efficiently as they do, but physics has its limits. These motions are the real performance killers for rotational storage like hard disks.
The HFS+ volume format stores all its information about files–metadata–in two primary locations on disk: the Catalog File, which stores file dates, permissions, ownership, and a host of other things, and the Attributes File, which stores “named forks.”
Extended attributes in HFS+ are implemented as named forks in the Attributes File. But unlike resource forks, which can be very large (up to the maximum file size supported by the file system), extended attributes in HFS+ are stored “inline” in the Attributes File. In practice, this means a limit of about 128 bytes per attribute. But it also means that the disk head doesn’t need to take a trip to another part of the disk to get the actual data.
As you can imagine, the disk blocks that make up the Catalog and Attributes files are frequently accessed, and therefore more likely than most to be in a cache somewhere. All of this conspires to make the complete storage of a file, including both its metadata in its data, within the B-tree-structured Catalog and Attributes files an overall performance win. Even an eight-byte payload that balloons to 25 bytes is not a concern, as long as it’s still less than the allocation block size for normal data storage, and as long as it all fits within a B-tree node in the Attributes File that the OS has to read in its entirety anyway.
There are other significant contributions to Snow Leopard’s reduced disk footprint (e.g., the removal of unnecessary localizations and “designable.nib” files) but HFS+ compression is by far the most technically interesting.