I've been considering moving my data off of cloud services such as CrashPlan and Dropbox and putting it back onto hardware that I own and fully control. In my new setup I have opted to store my data in a fault tolerant method using ZFS.

For a while now I have been considering moving my data off of cloud services such as CrashPlan and Dropbox and putting it back onto hardware that I own and fully control. Cloud services typically solve all of the hard problems for you (data integrity, availability, etc.), but this comes at the cost of trusting them with your data.

Recently I've decided that I'd rather not trust my data to third parties unless abolsutely necesssary, and instead of spending money on thirft party services, I have opted to invest the money in my own hardware and migrate my data back under my full control.

Although I have been wanting to do this for a while I have not found a way of storing my data in a fault tolerant fashion that would rival the measures used by thrift-party cloud services. Whilst RAID is a commonly used for fault tolerant data storage, I have heard too many bad story from others of how RAID can go wrong.

I've finally found the solution: ZFS.

ZFS is a combined file system and logical volume manager that was originally developed by Sun Microsystems for use on Solaris. Since then ZFS has been ported to work on various Linux distributions including, Debian/Ubunmtu, Red Hat and CentOS.

ZFS has many strengths which include:

  • Ever block of data read by ZFS (including the table of block pointers) is checksumed and can be recovered if an error is found
  • ZFS periodically checks the ensure file system for any data corrtuption which may have occurred since the data was written
  • A ZFS storage pool can easily be expanded by adding more drives
  • Uses in-line compression to save storage space and increase write speed.
  • Uses block de-duplication
  • Is a 128-bit fileystsem, which allows for maximum file sizes of 16 exabytes and a maximum volume size of 256 zeutabytes.
    and more...

In ZFS, disks can be combined into virtual groups and each group can be configured to use one of the following redundancy options:

  • Mirror
    Data from one disk is mirrored to another, this is the equivalent of RAID 1. Although this provides the best redundancy, it does require the most space. If you had 2 x 4TB drives, Mirror gives you 4TB of usable space.

  • Stripe
    Data is written accross disks, this is equivalent to RAID 0. If we assume a two disk strip array, the filesystem would be split between the two disks, with the first half on one disk and the second half on the other. Although this offers the best read/write performance there is no redundancy. If a drive fails you lose the portion of the file system that lives on the failed disk

  • RAID-Z
    One disk is used for parity and all others for data storage, this is equivlant to RAID 5. Raid-Z requires a minimum of 3 drives and allows for a single drive failure. If more than one drive fails, all data on the filesystem will be lost.

  • RAID-Z2
    This is the same as RAID-Z but requires a minimum of 4 drives, 2 of which are parity, 2 drives may fail.

  • RAID-Z3
    This is the same as RAID-Z but requires a minimum of 5 drives, 3 of which are parity, 3 drives may fail.

For my use case I opted for RAID-Z used in a 4 x 4TB drive array. This gives me a usable filesystem of 12TB and allows for 1 drive to fail.

Installing ZFS on Ubuntu 14.04

Firstly, install ZFS:

$ apt-add-repository --yes ppa:zfs-native/stable
$ apt-get update
$ apt-get install ubuntu-zfs

During the installation process ZFS has to build and install kernel modules to provide support for ZFS. Test that ZFS is working by running:

$ dmesg | grep ZFS

You should receive an output that looks similar to this:

[   14.793587] ZFS: Loaded module v0.6.4.1-1~trusty, ZFS pool version 5000, ZFS filesystem version 5

If you get a blank response try loading the module again by running the following and re-test:

modprobe zfs

Creating a RAID-Z disk array

When you create a RAID-Z disk array, you ideally want to use disks all of the same size, otherwise the smallest disk size will be used across all drives. In my example I am use 4x 4TB drives.

Use fdisk to determine which drives you want to add to your array. In my case I want to add /dev/sda, /dev/sdb, /dev/sdc, /dev/sdd.

We will now create a zpool called datastore used the drives found in the previous step:

$ zpool create -f datastore raidz /dev/sda /dev/sdb /dev/sdc /dev/sdd

Confirm that the zpool has been created by running:

$ zpool status datastore

The output should be similar to this:

pool: datastore
 state: ONLINE
  scan: none requested
config:

    NAME                                          STATE     READ WRITE CKSUM
    datastore                                     ONLINE       0     0     0
      raidz1-0                                    ONLINE       0     0     0
        sda                                       ONLINE       0     0     0
        sdb                                       ONLINE       0     0     0
        sdd                                       ONLINE       0     0     0
        sdd                                       ONLINE       0     0     0

Creating a dataset

So far we have installed ZFS and created a zpool that spans 4-drives using RAID-Z. We now need to make the volume accessable. For my usecase I am going to create a single volume called data in the dataset and this volume will be mounted at /mnt/data:

$ zfs create -o mountpoint=/mnt/data datastore/data

Finally test the dataset has been setup correctly, then we're ready to start storing data!

$ zfs list

The output should look similar to:

NAME             USED  AVAIL  REFER  MOUNTPOINT
datastore        312K  14T    38.6K  /datastore
datastore/data   38.6K 14T    38.6K  /mnt/data