Credit to: Kartik Ayyar, Formerly a filesystem engineer at NetApp
At the highest level, I’d split the benefits of comparing brtfs versus any other filesystem into 3 categories.
1. Benefits arising out of using a copy on write tree with no in place updates as storage primitive.
2. Benefits arising out out a separation of logical and physical units of data management.
3. Everything else
The first two buckets are the ones to pay attention to. If you take a look at literature describing brtfs, WAFL and ZFS, the first two are the real game changers for performance, reliability and data management - most of the other features can be found or ported to other filesystems.
Tree structure based benefits
1. Creating snapshots of your data is extremely efficient
This is the single most important benefit - snapshots are incredibly incredibly efficient to create, and have very little performance impact. This is because creating a snapshot is as simple as creating a new root pointer to your data set and incrementing reference counts on metadata. You can create snapshots in the midst of other activity on your system without adding any significant load to it.
2. Calculating incremental deltas for backups is extremely efficient.
Calculating deltas between two different versions of trees, be it snapshots or writeable clones is highly efficient, since it relies on a pure metadata comparison of reference counts in different snapshots.
3. Creating writeable clones is extremely efficient
Creating a clone is very similar to creating a snapshot and is thus similarly efficient as it simply involves creating a new root pointer to a tree and adding book keeping for reference counting.
This is a killer feature in environments where you have many copies of almost the same data, such as large test databases or virtual machine images.
4. Rolling back to a given snapshot is super efficient
Rolling back to a given older version of the filesystem is highly efficient as it primarily involves swapping a pointer to an older version of a tree.
5. The filesystem has transactional semantics
Non transactional semantics for a filesystem are bad. They can expose you to minor corruptions for operations that update different blocks in different places non atomically.
So for example, for creating a new directory entry, you need to allocate and inode and also make a directory entry point to it. If this is not done atomically, depending on your implementation, you could have a leaked inode or a directory entry that points to an unallocated inode.
With btrfs and similar filesystems, you never write in place and an update to a new version of the filesystem is only complete when you update the root pointer of thew new tree, this means that all your filesystem operations consistently move you from one consistent state to another.
If you crash in the middle of an update, since you never wrote in place and the final tree root pointer update never hit the disks, there is nothing required to get your filesystem back to a consistent state.
This class of benefits is the most important - NetApp shipped a product based on this idea in 1994, and it was a huge competitive differentiator with respect to its success. It was also one of the primary similarities cited when NetApp sued Sun for ZFS .
Benefits arising from have a separation of volumes and sub volumes
One key fundamentals of building a high performing file systems is trying to make full use of the IO bandwidth of a large collection of disks. For performance reasons, you want to have as many spindles as possible to parallelize IO.
The initial solution was to simply create one giant filesystem that had a one to one map to a large pool of disks to increase disk bandwidth and also amortize the cost of your RAID parity disk(s).
However, doing so creates a data management nightmare - you can’t for example snapshot or backup just your important data at a high frequency without also doing the same for your low priority data.
The solution that addressed both performance needs and data management needs was to make the tree structures live inside what was “traditional” filesystem that directly mapped to a a physical RAID volumes ( known as aggregates, pools and volumes in WAFL, ZFS and btrfs respectively ), while user data lived inside logical volumes ( known as flexible volumes, volumes and subvolumes in btrfs repectively) .
The benefits of this are as below:
1. High performance IO for even small filesystems by sharing spindles with larger filesystems
The idea here is your volume can be a direct mapping to your RAID layout, but your subvolume can be much smaller and still benefit from the raw IO bandwidth.
2. Instant, easy on the fly partition resizing
Partitions in the scheme of subvolumes are just quotas.
Specifically, different subvolumes actually do share free space, so if you find your volumes filling up in a way different form what you planned, you can just adjust the partitions on the fly.
Thus, it makes it easier to aggregate your free space and improve storage utilization and adjust your space allocation on the fly if your partitions are filling up in a manner different to how you had planned.
3. Decoupling the unit of data management from the unit of physical IO bandwidth / redundancy
By having subvolumes being distinct form physical volumes, you can do things such as set up different snapshot and backup schedules for part of your data, without all of it.
brtfs also has the use of extents to prevent fragmentation (unlike ZFS and WAFL btrfs seems to reference count extents rather than blocks ), block checksums, built in compression and various other features, but these are typically the class of features that can be adapted to other filesystems without changing the fundamental architecture of it.