There is a special issue when using ZFS-backed NFS for a Datastore under ESXi.
The problem is that the ESXi NFS client forces a commit/cache flush after every write. This makes sense in the context of what ESXi does as it wants to be able to reliably inform the guest OS that a particular block was actually written to the underlying physical disk. However for ZFS writes and cache flushes trigger ZIL event log entries.
The end result is that the ZFS array will end up doing a massively disproportional amount of writing to the ZIL log and throughput will suffer (I was seeing under 1 MiB/sec on Gigabit Ethernet!).
Performance Benchmarking
Here are the results of testing the various work-arounds, as you can see that modifying the kernel is the clear winner. This also has minimal side affects when compared to the other options.
Method | Read Speed | Read Ltncy. | Write Speed | Write Ltncy. |
NFS Kernel Mod | 67 MiB/sec | 341 ms | 110 MiB/sec | 153 ms |
zfs set sync=disabled | 69 MiB/sec | 198 ms | 69 MiB/sec | 1628 ms |
cache_flush_disable=”1″ | 67 Mib/sec | 760 ms | 16 MiB/sec | 1543 ms |
* Tested with dedicated 1 Gbit Ethernet interconnect.
Here are the four solutions:
IDEAL: Hack the NFS Subsystem
This makes the kernel ignore NFS clients’ requests to commit to disk, and in doing so does not pass along ESXi (or any other NFS client’s) request to commit/flush the cache to the file system.
This, in my view, is the ideal. If you have UPS power there is very very little risk here.
Per this article we’re going to modify nfs_nfsdport.c: http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html
vi /usr/src/sys/fs/nfsserver/nfs_nfsdport.c
Search for NFSWRITE_UNSTABLE and find this block:
if (stable == NFSWRITE_UNSTABLE) ioflags = IO_NODELOCKED; else ioflags = (IO_SYNC IO_NODELOCKED); uiop->uio_resid = retlen; uiop->uio_rw = UIO_WRITE;
And change it to:
// if (stable == NFSWRITE_UNSTABLE) ioflags = IO_NODELOCKED; // else // ioflags = (IO_SYNC | IO_NODELOCKED); uiop->uio_resid = retlen; uiop->uio_rw = UIO_WRITE;
Then recompile the kernel and remember this needs to be re-done after doing a freebsd-update
or if you update /usr/src.
The Other Options
There are other solutions, and for completeness’ sake here they are (and why I think the above solution is better):
SSD ZIL Disks
For this you optimally want two SSDs (mirrored for redundancy) to locate your ZIL on instead of the array disks themselves.
Especially when you consider that writing is what wears out SSDs, I think this is a poor solution as there will still be many excessive writes, they’re just faster.
Disable the ZIL Entirely
This is a pretty blunt solution, but a quick and easy temporary fix. Running this on a zvol:
zfs set sync=disabled zroot
Which turns off sync forcing/cache flushing for the entire FS. There are some who say this can lead to underlying ZFS corruption and cry wolf but per this article I do not believe that is the case: https://blogs.oracle.com/roch/entry/nfs_and_zfs_a_fine
What it does say though is that you can end up with NFS client corruption (in the form of inconsistency). This may be so but remember that the guest filesystem itself also has protections built into it (ie; NTFS or UFS) which can help mitigate these things.
And of course if everything is UPS backed (and nothing panics) this is even less of an issue.
I used this method temporarily until I made the NFS change and experienced no problems, but I dislike how this affects “everything” including native writes, Samba, etc.
Setting vfs.zfs.cache_flush_disable=”1″ in /boot/loader.conf
This I think is an older “solution” in the 8.x days, and the sync=disable option supersedes it. I found that while it did improve performance by a factor of 15x, that only meant 15 MiB/sec writes which I consider to be still unacceptable. And the “risks” are similar to the above sync=disable which has much better performance.
For Zil, if your worried about SSD failures (although that’s why you use two…) you can use these : http://www.hgst.com/solid-state-storage/enterprise-ssd/sas-ssd/zeusram-sas-ssd Rather expensive, but work very well.
Very slick, thanks!
Adam, I wonder if I understood this at all 🙂
I guess my setup is the other way around: a linux storage (nfs4 server), a xen dom0 server (centOS) and a freebsd VM. This freebsd VM has a ufs2 root and a zfs mount where the active jail is hosted.
Am I right that it should be no problem to disable zil (sync) as the nfs is not the layer writing into the zfs, so there is no risk for a data lost from this point?
Would be great to get your view about this!
Hey Jimmy,
My experience with the issue is specific to FreeBSD as the NFS server with ZFS, but as you may gather the underlying issue is caused by ESXi triggering the “flush” action when writing to the NFS server.
Xen likely does the same, however Linux’s (in your case CentOS) ext3/ext4 file system doesn’t have the severe reaction to this as ZFS. Nor does FreeBSD’s UFS (the ‘native’ file system of FreeBSD), which is ultimately what you’re writing too, correct?
That being said, how are you doing ZFS inside Xen? Virtual disks on the NFS server, or pass-through directly to devices? Can you show me the ‘zpool status’ output from the FreeBSD server?
Adam:
Thanks a lot for your article. We changed the file and recompiled Freenas 9.10. we’re now getting 90+ MB/S transfers speeds with ESXI.