Create ZFS Pool
Tags:- zfs
- zfsonlinux
- filesystems
- storage
- ubuntu-18-04
ZFS Overview
ZFS (fully called OpenZFS) is a software RAID and volume management tool - it replaces both hardware raid controllers, as well as volume managers, and snapshot tooling. It originated from Sun Microsystems, now Oracle, and is natively supported and developed by the FreeBSD team. It was then ported linux under the project “ZFS on Linux”.
ZFS provides a number of features, including:
- Hardware RAID - no more raid cards
- Datasets - Allowing one pool to be split into smaller flexible pools with their own replication, properties, and quotas.
- Snapshots - Lightweight snapshots can be created, storing only the data that has been changed
- ZFS Send / Receive - This allows a pool or dataset snapshot to be sent, on a block-level. This includes incremental sending, meaning that only changed data between snapshots must be sent.
- Scrub / Metadata Integrity - ZFS checksums all data on disk to ensure no physical corruption can cause issues with the data. Checksums are checked on every FS block read, as well as on a manual scrub, which is typically run about once-per-week, or at least multiple times per month.
Important Terms and Concepts
Based on a guide from the FreeBSD handbook
ZFS has a number of important terms and concepts to be aware of.
Pool
A storage pool is the most basic building block of ZFS. A pool is made up of one or more vdevs, the underlying devices that store the data. A pool is then used to create one or more file systems (datasets) or block devices (volumes). These datasets and volumes share the pool of remaining free space. Each pool is uniquely identified by a name and a GUID. The features available are determined by the ZFS version number on the pool.
Vdev
A pool consists of one or more striped vdevs. To use standard raid terminology, a pool is a RAID-0 of vdevs. As such - it is important that each vdev be fully redundant, because a loss of a vdev will result in the loss of a pool. Redundancy is provided by making the vdevs themselves redundant.
In a production system, a vdev must be made up of a redundant set of disks. If a single disk is imported as a vdev, the loss of that single disk will cause the loss of the pool.
However, for testing and development, it may be suitable to use single-disk-vdevs, with
the understanding that it is similar to a RAID-0 raid volume, in which a single disk
will cause the pool to be lost. Additionally, vdevs can not be removed from a pool. If a
pool needs to be downsized, a new pool needs to be built, then use zfs send
and zfs
receive
to send the data to the new pool.
Below is a list of the types of vdevs:
Mirror
A mirror is a set of disks containing the same data, and can contain two or more disks.
When creating a mirror, specify the mirror keyword followed by the list of member devices for the mirror. While the devices in the mirror can be mixed, a mirror vdev will only hold as much data as its smallest member. A mirror vdev can withstand the failure of all but one of its members without losing any data.
A mirror will write at the speed of it’s slowest disk, as all devices need to receive each written block, and will read at a sum of the devices, because any block can be read from any device in the mirror.
A regular single disk vdev can be upgraded to a mirror vdev at any time with zpool attach
.
RAID-Z
RAID-Z is a software implementation of RAID5 - it performs similarly to standard RAID-5 implementations. The capacity of each disk can be mixed, but only the lowest-disk capacity is used across all disks. The capacity of a RAID-Z vdev is the sum of the raw disk capacity, minus one disk for redundancy.
It is recommended to add no more than nine disks in a single vdev. If the configuration has more disks, it is recommended to divide them into separate vdevs and the pool data will be striped across them.
RAID-Z technically production safe, but not recommended, because a single disk failure results in the vdev having no redundancy. Use RAID-Z2 or 3 if possible.
RAID-Z2
RAID-Z2 is a software implementation of RAID6. It is similar to RAID-Z, but offers two redundant disks.
RAID-Z2
RAID-Z2 is a software implementation of RAID6. It is similar to RAID-Z, but offers three redundant disks. It will be slower than RAID-Z2 and 1, but is also highly resilient - you can loose up to 3 disks from the vdev safely.
Disk - not suitable for production
A disk vdev is a single disk. The most basic type of vdev is a standard block device.
This can be an entire disk (such as /dev/sda
) or a partition (/dev/sda1
). This is
not suitable for production because the loss of that single disk will cause the pool to
fail.
However, a disk vdev can be upgraded to a mirror by attaching a second disk of equal or larger size.
File - not suitable for production
In addition to disks, ZFS pools can be backed by regular files, this is especially useful for testing and experimentation. Use the full path to the file as the device path in zpool create. All vdevs must be at least 128 MB in size.
Spare
ZFS has a special pseudo-vdev type for keeping track of available hot spares. Note that
installed hot spares are not deployed automatically; they must manually be configured to
replace the failed device using zfs replace
.
Log
ZFS Log Devices, also known as ZFS Intent Log (ZIL) move the intent log from the regular pool devices to a dedicated device, typically an SSD. Having a dedicated log device can significantly improve the performance of applications with a high volume of synchronous writes, especially databases. Log devices can be mirrored, but RAID-Z is not supported. If multiple log devices are used, writes will be load balanced across them.
Cache
Adding a cache vdev to a pool will add the storage of the cache to the L2ARC. Cache devices cannot be mirrored. Since a cache device only stores additional copies of existing data, there is no risk of data loss. Note that having a large cache device can have adverse effects, because files in the cache must be referenced in memory. If the cache is too large, it can reduce the ability to cache data in memory, which will be faster than any cache device.
Others
RAID Card Note
Ideally, RAID cards should be replaced with JBOD HBA devices, to allow the ZFS to have full control over the drive. However, if this is not possible for whatever reason, setup each individual disk as it’s own single-disk RAID 0 volume.
This is sub-par, because this stops ZFS from doing some lower-level drive access, instead providing it a RAID card abstraction, but can be acceptable if required.
Install Packages
First, install the needed packages from the apt
repos.
sudo apt-add-repository universe -y &&
sudo apt update &&
sudo apt install -y zfsutils-linux
To confirm the ZFS package is installed properly, run:
sudo zpool status
This should return no pools available
, which confirms that ZFS is installed,
and no pool is set up.
Find Your Disks
Find all your block devices with lsblk
(LiSt BLocK devices)
lsblk
Output will look like:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 86.9M 1 loop /snap/core/4917
sda 8:0 0 136.1G 0 disk
sda1 8:1 0 512M 0 part /boot/efi
sda2 8:2 0 135.6G 0 part /
sdb 8:16 0 2.7T 0 disk
sdc 8:32 0 2.7T 0 disk
sdd 8:48 0 2.7T 0 disk
sde 8:64 0 2.7T 0 disk
sdf 8:80 0 558.4G 0 disk
sdg 8:96 0 558.4G 0 disk
sdh 8:112 0 558.4G 0 disk
sdi 8:128 0 558.4G 0 disk
sdj 8:144 0 558.4G 0 disk
sdk 8:160 0 558.4G 0 disk
The first column is the device name in /dev/*
. That is: sde
is the
device found at /dev/sde
on the system.
In this case, we are setting up 5x vdevs, each a mirror of two drives:
- mirror 1 (3.0t):
sdb
,sdc
- mirror 2 (3.0t):
sdd
,sde
- mirror 3 (0.5t):
sdf
,sdg
- mirror 4 (0.5t):
sdh
,sdi
- mirror 5 (0.5t):
sdj
,sdk
These mirrors are added as high-level devices in the pool we will call tank
for a
total usable space of about 7.5T.
Pool Setup Basics
With the pool layout and device names found above, we create the pool.
The command syntax is as follows:
zpool
: this is the name of the command line toolcreate
: create a new pool[name of pool]
: a string value naming the pool. Tank is often a default value used.- vdev definition block (repeat once per vdev in your pool)
[type of vdev]
: in our case, we’re using mirror. Could also beRAIDZ
(raid5),RAIDZ2
(raid 6), orRAIDZ3
(raid7), or some other values.device names in vdev
Since this may not be clear, here are a few examples of commands and their output:
zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0
Create a pool called tank
, with two vdevs. First vdev is a mirror
with devices /dev/c1d0
and /dev/c2d0
, second vdev is a mirror
with devices /dev/c3d0
and /dev/c4d0
.
zpool create tank raidz2 sdb sdc sdd sde sdf
Create a pool called tank
, with one vdev. vdev is a raidz2 (raid6),
consisting of /dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
and /dev/sdf
.
Create Pool
In our case, the pool layout found above, we will create a pool called tank with the mirrors we found above using the following command:
sudo zpool create -n tank mirror sdb sdc mirror sdd sde mirror sdf sdg mirror sdh sdi mirror sdj sdk
The -n
flag makes it a dry-run. This will allow you to see the proposed result of your
actions.
The output will look like:
would create 'tank' with the following layout:
tank
mirror
sdb
sdc
mirror
sdd
sde
mirror
sdf
sdg
mirror
sdh
sdi
mirror
sdj
sdk
This is what we expect: 5 pairs of mirrors.
Tweak the syntax with dry-runs as many times as is needed. No changes are made to the disks when run in dry-run mode.
When you are satisfied, remove -n
, and the command will actually be executed on the disks.
sudo zpool create tank mirror sdb sdc mirror sdd sde mirror sdf sdg mirror sdh sdi mirror sdj sdk
If you receive no output, it was likely successful. Confirm by running a few commands to view the status of your pool.
zpool status
- detailed pool information, including disk healthzfs list
- list of all pools and datasets, showing free/used spacedf -h
- systemwide disk-space monitor - your pool should now be mounted in the root of the disk at your pool name (ie - a pool name tank is mounted by default at/tank
).
Create Datasets
Datasets are ways of dividing up a pool. A dataset can be assigned quotas, replicated, configured, mounted, and snapshot seperately from every other dataset.
For example, you might set one dataset for home directories, and another for VM images. Your VM images might be snapshotted nightly, replicated off the host, and stored with compression disabled, whereas your home directories would be snapshotted hourly with a higher compression level enabled.
Each dataset is treated as it’s own linux filesystem, but the storage capacity of the entire pool is shared across datasets (unless you apply quotas or reservations to restrict capacity use).
Datasets can also be nested, like standard linux filesystems.
Create some datasets using the command zfs create [pool]/[datasetname]
. Note that
datasets can be nested, but are not created recursively. That is: you can not create
tank/system/homedirs
until you have created tank/system
.
sudo zfs create tank/vms
sudo zfs create tank/vms/installers
sudo zfs create tank/vms/disks
sudo zfs create tank/system
sudo zfs create tank/system/homedirs
This creates datasets for VM install images, VM disks, and the home directories.
After creating your datasets, look again using the tools listed above (df -h
, zfs
list
, etc), and you’ll see your datasets listed there.
Mount After Reboot
Important! By default, ZFS does not mount it’s pools on boot. This must be handled using a tool or another method.
Using ZFS-on-linux, the easiest way to do this is typically the service zfs-import-scan
.
Before you reboot, start the service to ensure the pool mounts on boot.
sudo service zfs-import-scan start
Or, you can import a pool manually:
sudo zpool import [poolname]