ZFS (fully called OpenZFS) is a software RAID and volume management tool - it replaces both hardware raid controllers, as well as volume managers, and snapshot tooling. It originated from Sun Microsystems, now Oracle, and is natively supported and developed by the FreeBSD team. It was then ported linux under the project “ZFS on Linux”.
ZFS provides a number of features, including:
- Hardware RAID - no more raid cards
- Datasets - Allowing one pool to be split into smaller flexible pools with their own replication, properties, and quotas.
- Snapshots - Lightweight snapshots can be created, storing only the data that has been changed
- ZFS Send / Receive - This allows a pool or dataset snapshot to be sent, on a block-level. This includes incremental sending, meaning that only changed data between snapshots must be sent.
- Scrub / Metadata Integrity - ZFS checksums all data on disk to ensure no physical corruption can cause issues with the data. Checksums are checked on every FS block read, as well as on a manual scrub, which is typically run about once-per-week, or at least multiple times per month.
Important Terms and Concepts
Based on a guide from the FreeBSD handbook
ZFS has a number of important terms and concepts to be aware of.
A storage pool is the most basic building block of ZFS. A pool is made up of one or more vdevs, the underlying devices that store the data. A pool is then used to create one or more file systems (datasets) or block devices (volumes). These datasets and volumes share the pool of remaining free space. Each pool is uniquely identified by a name and a GUID. The features available are determined by the ZFS version number on the pool.
TODO: document pool upgrade procedures
A pool consists of one or more striped vdevs. To use standard raid terminology, a pool is a RAID-0 of vdevs. As such - it is important that each vdev be fully redundant, because a loss of a vdev will result in the loss of a pool. Redundancy is provided by making the vdevs themselves redundant.
In a production system, a vdev must be made up of a redundant set of disks. If a single disk is imported as a vdev, the loss of that single disk will cause the loss of the pool.
However, for testing and development, it may be suitable to use single-disk-vdevs, with
the understanding that it is similar to a RAID-0 raid volume, in which a single disk
will cause the pool to be lost. Additionally, vdevs can not be removed from a pool. If a
pool needs to be downsized, a new pool needs to be built, then use
zfs send and
receive to send the data to the new pool.
Below is a list of the types of vdevs:
A mirror is a set of disks containing the same data, and can contain two or more disks.
When creating a mirror, specify the mirror keyword followed by the list of member devices for the mirror. While the devices in the mirror can be mixed, a mirror vdev will only hold as much data as its smallest member. A mirror vdev can withstand the failure of all but one of its members without losing any data.
A mirror will write at the speed of it’s slowest disk, as all devices need to receive each written block, and will read at a sum of the devices, because any block can be read from any device in the mirror.
A regular single disk vdev can be upgraded to a mirror vdev at any time with
RAID-Z is a software implementation of RAID5 - it performs similarly to standard RAID-5 implementations. The capacity of each disk can be mixed, but only the lowest-disk capacity is used across all disks. The capacity of a RAID-Z vdev is the sum of the raw disk capacity, minus one disk for redundancy.
It is recommended to add no more than nine disks in a single vdev. If the configuration has more disks, it is recommended to divide them into separate vdevs and the pool data will be striped across them.
RAID-Z technically production safe, but not recommended, because a single disk failure results in the vdev having no redundancy. Use RAID-Z2 or 3 if possible.
RAID-Z2 is a software implementation of RAID6. It is similar to RAID-Z, but offers two redundant disks.
RAID-Z2 is a software implementation of RAID6. It is similar to RAID-Z, but offers three redundant disks. It will be slower than RAID-Z2 and 1, but is also highly resilient - you can loose up to 3 disks from the vdev safely.
Disk - not suitable for production
A disk vdev is a single disk. The most basic type of vdev is a standard block device.
This can be an entire disk (such as
/dev/sda) or a partition (
/dev/sda1). This is
not suitable for production because the loss of that single disk will cause the pool to
However, a disk vdev can be upgraded to a mirror by attaching a second disk of equal or larger size.
TODO: document the command to upgrade a disk to a mirror.
File - not suitable for production
In addition to disks, ZFS pools can be backed by regular files, this is especially useful for testing and experimentation. Use the full path to the file as the device path in zpool create. All vdevs must be at least 128 MB in size.
ZFS has a special pseudo-vdev type for keeping track of available hot spares. Note that
installed hot spares are not deployed automatically; they must manually be configured to
replace the failed device using
ZFS Log Devices, also known as ZFS Intent Log (ZIL) move the intent log from the regular pool devices to a dedicated device, typically an SSD. Having a dedicated log device can significantly improve the performance of applications with a high volume of synchronous writes, especially databases. Log devices can be mirrored, but RAID-Z is not supported. If multiple log devices are used, writes will be load balanced across them.
Adding a cache vdev to a pool will add the storage of the cache to the L2ARC. Cache devices cannot be mirrored. Since a cache device only stores additional copies of existing data, there is no risk of data loss. Note that having a large cache device can have adverse effects, because files in the cache must be referenced in memory. If the cache is too large, it can reduce the ability to cache data in memory, which will be faster than any cache device.
TODO: Document the following concepts:
- pool - done
- vdev - done
- Degraded pool
- each of the vdev types (mirror, RAIDz, etc)
- Special devices
RAID Card Note
Ideally, RAID cards should be replaced with JBOD HBA devices, to allow the ZFS to have full control over the drive. However, if this is not possible for whatever reason, setup each individual disk as it’s own single-disk RAID 0 volume.
This is sub-par, because this stops ZFS from doing some lower-level drive access, instead providing it a RAID card abstraction, but can be acceptable if required.
First, install the needed packages from the
sudo apt-add-repository universe -y && sudo apt update && sudo apt install -y zfsutils-linux
To confirm the ZFS package is installed properly, run:
sudo zpool status
This should return
no pools available, which confirms that ZFS is installed,
and no pool is set up.
Find Your Disks
Find all your block devices with
lsblk (LiSt BLocK devices)
Output will look like:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 86.9M 1 loop /snap/core/4917 sda 8:0 0 136.1G 0 disk sda1 8:1 0 512M 0 part /boot/efi sda2 8:2 0 135.6G 0 part / sdb 8:16 0 2.7T 0 disk sdc 8:32 0 2.7T 0 disk sdd 8:48 0 2.7T 0 disk sde 8:64 0 2.7T 0 disk sdf 8:80 0 558.4G 0 disk sdg 8:96 0 558.4G 0 disk sdh 8:112 0 558.4G 0 disk sdi 8:128 0 558.4G 0 disk sdj 8:144 0 558.4G 0 disk sdk 8:160 0 558.4G 0 disk
The first column is the device name in
/dev/*. That is:
sde is the
device found at
/dev/sde on the system.
In this case, we are setting up 5x vdevs, each a mirror of two drives:
- mirror 1 (3.0t):
- mirror 2 (3.0t):
- mirror 3 (0.5t):
- mirror 4 (0.5t):
- mirror 5 (0.5t):
These mirrors are added as high-level devices in the pool we will call
tank for a
total usable space of about 7.5T.
Pool Setup Basics
With the pool layout and device names found above, we create the pool.
The command syntax is as follows:
zpool: this is the name of the command line tool
create: create a new pool
[name of pool]: a string value naming the pool. Tank is often a default value used.
- vdev definition block (repeat once per vdev in your pool)
[type of vdev]: in our case, we’re using mirror. Could also be
RAIDZ2(raid 6), or
RAIDZ3(raid7), or some other values.
device names in vdev
Since this may not be clear, here are a few examples of commands and their output:
zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0
Create a pool called
tank, with two vdevs. First vdev is a mirror
/dev/c2d0, second vdev is a mirror
zpool create tank raidz2 sdb sdc sdd sde sdf
Create a pool called
tank, with one vdev. vdev is a raidz2 (raid6),
In our case, the pool layout found above, we will create a pool called tank with the mirrors we found above using the following command:
sudo zpool create -n tank mirror sdb sdc mirror sdd sde mirror sdf sdg mirror sdh sdi mirror sdj sdk
-n flag makes it a dry-run. This will allow you to see the proposed result of your
The output will look like:
would create 'tank' with the following layout: tank mirror sdb sdc mirror sdd sde mirror sdf sdg mirror sdh sdi mirror sdj sdk
This is what we expect: 5 pairs of mirrors.
Tweak the syntax with dry-runs as many times as is needed. No changes are made to the disks when run in dry-run mode.
When you are satisfied, remove
-n, and the command will actually be executed on the disks.
sudo zpool create tank mirror sdb sdc mirror sdd sde mirror sdf sdg mirror sdh sdi mirror sdj sdk
If you receive no output, it was likely successful. Confirm by running a few commands to view the status of your pool.
zpool status- detailed pool information, including disk health
zfs list- list of all pools and datasets, showing free/used space
df -h- systemwide disk-space monitor - your pool should now be mounted in the root of the disk at your pool name (ie - a pool name tank is mounted by default at
Datasets are ways of dividing up a pool. A dataset can be assigned quotas, replicated, configured, mounted, and snapshot seperately from every other dataset.
For example, you might set one dataset for home directories, and another for VM images. Your VM images might be snapshotted nightly, replicated off the host, and stored with compression disabled, whereas your home directories would be snapshotted hourly with a higher compression level enabled.
Each dataset is treated as it’s own linux filesystem, but the storage capacity of the entire pool is shared across datasets (unless you apply quotas or reservations to restrict capacity use).
Datasets can also be nested, like standard linux filesystems.
Create some datasets using the command
zfs create [pool]/[datasetname]. Note that
datasets can be nested, but are not created recursively. That is: you can not create
tank/system/homedirs until you have created
sudo zfs create tank/vms sudo zfs create tank/vms/installers sudo zfs create tank/vms/disks sudo zfs create tank/system sudo zfs create tank/system/homedirs
This creates datasets for VM install images, VM disks, and the home directories.
After creating your datasets, look again using the tools listed above (
list, etc), and you’ll see your datasets listed there.
TODO: ADD MORE DETAILS HERE ABOUT HOW DATASETS LOOK LIKE NORMAL SUBDIRS BUT AREN’T TODO: ADD MORE DETAILS HERE ABOUT DATASET NAMES VS MOUNT PATHS