I've been considering moving my data off of cloud services such as CrashPlan and Dropbox and putting it back onto hardware that I own and fully control. In my new setup I have opted to store my data in a fault tolerant method using ZFS.
For a while now I have been considering moving my data off of cloud services such as CrashPlan and Dropbox and putting it back onto hardware that I own and fully control. Cloud services typically solve all of the hard problems for you (data integrity, availability, etc.), but this comes at the cost of trusting them with your data.
Recently I've decided that I'd rather not trust my data to third parties unless abolsutely necesssary, and instead of spending money on thirft party services, I have opted to invest the money in my own hardware and migrate my data back under my full control.
Although I have been wanting to do this for a while I have not found a way of storing my data in a fault tolerant fashion that would rival the measures used by thrift-party cloud services. Whilst RAID is a commonly used for fault tolerant data storage, I have heard too many bad story from others of how RAID can go wrong.
I've finally found the solution: ZFS.
ZFS is a combined file system and logical volume manager that was originally developed by Sun Microsystems for use on Solaris. Since then ZFS has been ported to work on various Linux distributions including, Debian/Ubunmtu, Red Hat and CentOS.
ZFS has many strengths which include:
- Ever block of data read by ZFS (including the table of block pointers) is checksumed and can be recovered if an error is found
- ZFS periodically checks the ensure file system for any data corrtuption which may have occurred since the data was written
- A ZFS storage pool can easily be expanded by adding more drives
- Uses in-line compression to save storage space and increase write speed.
- Uses block de-duplication
- Is a 128-bit fileystsem, which allows for maximum file sizes of 16 exabytes and a maximum volume size of 256 zeutabytes.
In ZFS, disks can be combined into virtual groups and each group can be configured to use one of the following redundancy options:
Data from one disk is mirrored to another, this is the equivalent of RAID 1. Although this provides the best redundancy, it does require the most space. If you had 2 x 4TB drives, Mirror gives you 4TB of usable space.
Data is written accross disks, this is equivalent to RAID 0. If we assume a two disk strip array, the filesystem would be split between the two disks, with the first half on one disk and the second half on the other. Although this offers the best read/write performance there is no redundancy. If a drive fails you lose the portion of the file system that lives on the failed disk
One disk is used for parity and all others for data storage, this is equivlant to RAID 5. Raid-Z requires a minimum of 3 drives and allows for a single drive failure. If more than one drive fails, all data on the filesystem will be lost.
This is the same as RAID-Z but requires a minimum of 4 drives, 2 of which are parity, 2 drives may fail.
This is the same as RAID-Z but requires a minimum of 5 drives, 3 of which are parity, 3 drives may fail.
For my use case I opted for RAID-Z used in a 4 x 4TB drive array. This gives me a usable filesystem of 12TB and allows for 1 drive to fail.
Installing ZFS on Ubuntu 14.04
Firstly, install ZFS:
$ apt-add-repository --yes ppa:zfs-native/stable $ apt-get update $ apt-get install ubuntu-zfs
During the installation process ZFS has to build and install kernel modules to provide support for ZFS. Test that ZFS is working by running:
$ dmesg | grep ZFS
You should receive an output that looks similar to this:
[ 14.793587] ZFS: Loaded module v0.6.4.1-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
If you get a blank response try loading the module again by running the following and re-test:
Creating a RAID-Z disk array
When you create a RAID-Z disk array, you ideally want to use disks all of the same size, otherwise the smallest disk size will be used across all drives. In my example I am use 4x 4TB drives.
fdisk to determine which drives you want to add to your array. In my case I want to add
We will now create a zpool called datastore used the drives found in the previous step:
$ zpool create -f datastore raidz /dev/sda /dev/sdb /dev/sdc /dev/sdd
Confirm that the zpool has been created by running:
$ zpool status datastore
The output should be similar to this:
pool: datastore state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM datastore ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdd ONLINE 0 0 0 sdd ONLINE 0 0 0
Creating a dataset
So far we have installed ZFS and created a zpool that spans 4-drives using RAID-Z. We now need to make the volume accessable. For my usecase I am going to create a single volume called data in the dataset and this volume will be mounted at
$ zfs create -o mountpoint=/mnt/data datastore/data
Finally test the dataset has been setup correctly, then we're ready to start storing data!
$ zfs list
The output should look similar to:
NAME USED AVAIL REFER MOUNTPOINT datastore 312K 14T 38.6K /datastore datastore/data 38.6K 14T 38.6K /mnt/data