My 2023 all-flash ZFS NAS (Network Storage) build

Am I misreading or did he create a non-redundant pool spanning the 2 SSD drives? I don't think scrubbing will keep him from losing data when one of those drives fails or gets corrupted.

Edit: Looked again and he's getting redundancy by running multiple NASes and rsyncing between them. Still seems like a risky setup though.

throw0101c · 2 years ago

In the just-released 2.2.0 you can correct blocks from a remote backup copy:

> Corrective "zfs receive" (#9372) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.

* https://github.com/openzfs/zfs/releases/tag/zfs-2.2.0

* https://github.com/openzfs/zfs/pull/9372

willglynn · 2 years ago

-- if using `zfs send` to create the replica, which the author of the post is not.

insaneirish · 2 years ago

> I don't think scrubbing will keep him from losing data when one of those drives fails or gets corrupted.

It pains me to see a ZFS pool with no redundancy because instead of being "blissfully" ignorant of bit rot, you'll be alerted to its presence and then have to attempt to recover the files manually via the replica on the other NAS.

I appreciate that the author recognizes what his goals are, and that pool level redundancy is not one of them, but my goals are very different.

Fnoord · 2 years ago

In my setup I combine one machine with ZFS with another running btrsfs, therefore using rsync/restic. And Apple devices uses APFS. I'd rather use ZFS solely (or in future: bcachefs) but unfortunately Apple went w/their native next-gen CoW filesystem and I don't think ZFS is available for Synology DSM. Though perhaps I should simply replace DSM with something more my liking (Debian-based, Proxmox).

prmoustache · 2 years ago

On the other hand, I myself use a similar approach. For home purpose I prefer having redundancy across several cheaper machines without raid1,10,5,6,z than a more expensive single machine with disk redundancy. And I'd rather pay the additionnal money on putting ECC ram on all machines.

Dead Comment

Too · 2 years ago

For home-use this is a reasonable tradeoff. Imagine my non tech-savvy wife for some reason has to get access to the data when one of the NAS has malfunctioned because the zfs pool encryption was badly configured. Explaining "zfs receive" or let alone what zfs, linux and ssh is going to be grounds for divorce. Heck, i don't even want to read man pages about zfs myself on weekends, i have enough such problems at work. Besides, you still want 2 physical locations.

It's going to be less optimal and less professional. That's ok, for something as important as backups, keep it boring. Simply starting up the second box is simple, stupid and has well-understood failure modes. Maybe someone like me should just buy an off-the shelf NAS.

eternityforest · 2 years ago

That seems like a pretty reasonable plan actually. If I ever do a NAS I was thinking to have one disk for storing files, and a pair of of disks for backing up both the first disk and my laptop.

That way everything has 3 copies, but I'm not backing up a compressed backup that might not deduplicate well and might be a little excessive to have 4 copies.

dpedu · 2 years ago

It wasn't mentioned in the blog, but you can set `copies=N` on ZFS filesystems and ZFS will keep N copies of user data. This provides redundancy that protects against bitrot or minor corruption, even if your zpool isn't mirrored. Of course, it provides no protection against complete drive failure.

j45 · 2 years ago

Curious if something like Ceph would help.

magicalhippo · 2 years ago

At a massive performance cost.

KaiserPro · 2 years ago

ceph is never the answer. Its a nice idea, but in practice its just not that useful.

if you want speed: lustre

if you want anything else: gpfs.

Dead Comment

I did something similar this year with 12x 4tb NVMe drives. I had a 12x 2tb setup is an fashstor 12 but the CPU and RAM limitations led me to traditional RAID there. For the second NAS I just went standard parts but with extreme performance, 13900K and the like. To get the necessary PCIe lanes (since not every port would bifurcate) I used 3 4 way switches from aliexpress. Choosing a motherboard with 2 x8 cpu slots made the bandwidth not a problem, the third does hang off the demo 4.0 lanes though. The boot drive hangs directly off the cpu in the x4 slot. The 13900K actually supports ECC in the workstation motherboards as well. I just expose everything via NFS and call it a day

All in all I'm happy with it. I still use spinning rust for the backup though, no sense worrying about that. At the time it was cheaper to get 12x4 TB budget NVMe and the extra parts than go for any reasonable count of SATA SSDs, not sure if that's still true or not. SATA definitely would have been easier, e.g. stuff like in the Flashstor 12x2TB build I had to submit a kernel patch because the cheap drives were duplicating their nsids.

magicalhippo · 2 years ago

> the CPU and RAM limitations led me to traditional RAID there

IIRC there was some talk that RAID-Z lead to write amplification compared to mirrored drives, thus not good for cheaper SSDs. Haven't had time to sit down and think it through, does anyone know if thats right or wrong?

> I used 3 4 way switches from aliexpress

Did you find any that were significantly cheaper than ~100 USD?

zamadatix · 2 years ago

In general I've found cheap SSDs these days have better endurances compared to cheap spinning disks anyways. https://www.servethehome.com/discussing-low-wd-red-pro-nas-h.... I think the drives I got were 1.6 PBW rated, no errors or failures yet. Beyond that, my workload isn't write heavy enough for this to be a concern of mine whether or not there is amplification. That SSDs don't count reads or active drive time in the endurance rating was more important. For the Flashstor build I did RAID4 instead of RAID5 just because I could though.

I want to say that sounds about right on price. If I didn't also have a desire for high single core performance at the same time a used/old Threadripper/Epyc build and bifurcation would probably make more sense. I also disconnected the onboard tiny "definitely going to be noisy as hell in 3 months" low quality fans from them and just rested a 140mm blowing down across the top of the 3 cards at 30% speed. Temps of the controllers and SSDs became better and it's dead silent.

gigel82 · 2 years ago

Can you expand more on the hardware used? Specifically, which 4 way switches you used?

I have a bunch of unused M.2 drives from decommissioned servers that I'd love to use as additional storage.

zamadatix · 2 years ago

I started with this page for adapters people had tried and recommended here https://forums.servethehome.com/index.php?threads/multi-nvme... but I think I ultimately went with the LinkReal x8 to quad.

If you don't specifically want the high per core performance of something like a 13900k a used/old Threadripper or Epyc system and bifurcation might make more sense. It'll also enable you to get maximum per drive bandwidth, if that's a concern for your (when you have 12 drives in some form of stripe and parity the per drive bandwidth ends up not being that important for most sane workloads though).

robbintt · 2 years ago

What brand of 4tb nvmes are you using?

zamadatix · 2 years ago

I scooped up some MP34s on sale. The have something like a 1.6 PBW endurance rating and I haven't had a single one fail or start throwing errors yet (though my workload isn't as write heavy as many might have).

quux · 2 years ago

praseodym · 2 years ago

I noticed that the author is using ZFS-native encryption, which in my experience is not particularly stable. I've even managed to corrupt an entire pool with ZFS send/receives when trying to make backups. I'd strongly recommend using ZFS-on-LUKS instead if encryption is required.

The list of open issues on GitHub in the native encryption component is quite telling: https://github.com/openzfs/zfs/issues?q=is%3Aopen+is%3Aissue...

Modified3019 · 2 years ago

I have to agree, the leadership’s response and handling of errors in regards to encryption has been… honestly pretty disappointing to see as a zfs fan and user.

While it’s not common, the amount of people running into edge cases and killing both their sending and sometimes even receiving pools (better have another backup!) is frankly unacceptable. “Raw sends” seem to be especially at risk, though sending in general seems to be where the issues mostly lay. My thoughts mirror the comments here: https://discourse.practicalzfs.com/t/is-native-encryption-re...

Here’s a currently unmaintained document by Rincebrain that he was using to try to track things before he got burned out by the lack or response. https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6sww...

didntcheck · 2 years ago

I was quite surprised when I learned that ZFS-native encryption is so underdeveloped, and that the developers still sometimes seem to forget about it when coding new features. I had presumed it would be table stakes for enterprise storage by now. Is it just the case that everyone uses it on GELI or LUKS, or am I wrong to presume that encryption is as widespread in enterprise deployments as I thought?

It's a shame, because the main draw of the native encryption to me is being able to have a zero-trust backup target with all the advantages of zfs send over user-level backup software. But I've heard of several people running into issues doing this (though luckily no actual data loss that I've heard, just headaches)

sneak · 2 years ago

I think large enterprises usually do encryption on the application level and not storage level these days. A KMS or encryption service, then storage services just hold encrypted blobs.

I run the 50 zfs disks in my house on LUKS.

aborsy · 2 years ago

I use it with snapshots and ZFS send, and so far it’s been fine. Truenas users also may use it.

But I’m concerned reading these comments. Anyone else experiencing issues?

secabeen · 2 years ago

The fundamental issue with ZFS encryption is that the primary developer that created it is no longer contributing significantly to the project. It's good code, with good tests, but it's not getting any additional love.

The utilities and tooling surrounding encryption are also weak, and there are ways you can throw away critical, invisible keydata without realizing it, and no tool to allow the correction of the issue, even if you have the missing keydata on another system.

I've been using ZFS with native encryption (Ubuntu Server) but also ZFS with LUKS (Arch, Ubuntu Desktop). Zero issues (though inability to run latest kernel can be annoying, esp on rolling distro). Wouldn't surprise me if write cache has a role in this issue though.

At EuroBSDcon 2022, Allan Jude gave the presentation "Scaling ZFS for NVMe":

> Learn about how ZFS is being adapted to the ways the rules of storage are being changed by NVMe. In the past, storage was slow relative to the CPU so requests were preprocessed, sorted, and coalesced to improve performance. Modern NVMe is so low latency that we must avoid as much of this preprocessing as possible to maintain the performance this new storage paradigm has to offer.

> An overview of the work Klara has done to improve performance of multiple ZFS pools of large numbers of NVMe disks on high thread count machines.

[…]

> A walkthrough of how we improved performance from 3 GB/sec to over 7 GB/sec of writes.

* https://www.youtube.com/watch?v=v8sl8gj9UnA

When ZFS was created spinning rust was still the main thing, and SSDs were gaining popularity, so ZFS created "hybrid storage" pools:

* https://cacm.acm.org/magazines/2008/7/5377-flash-storage-mem...

* https://web.archive.org/web/20080613124922/http://blogs.sun....

* https://web.archive.org/web/20080615042818/http://blogs.sun....

* https://www.brendangregg.com/blog/2009-10-08/hybrid-storage-...

* https://en.wikipedia.org/wiki/Hybrid_array

Still useful for use cases that lean towards bulk storage (versus IOps).

NKosmatos · 2 years ago

Oh I remember this guy! Michael has a nice blog where he writes interesting posts about hardware, networks and all sort of computer stuff. Sometimes I'm feeling a little bit jealous with all these "toys" he has, but he writes them in a good way (without wanting to show off). If you haven't seen it, read about when he upgraded his internet to 25 Gbit/s fiber: https://michael.stapelberg.ch/posts/2022-04-23-fiber7-25gbit...

Coming back on topic, the idea of having 3 custom made NASes is surely interesting, but apart from learning and experimenting with all these, I don't see a very big advantage from backup/security point of view compared to commecrial NASes (Synology, QNAP).

For sure we can all argue here about selection of filesystem (ZFS, btrfs, ext4...), selection of CPU, RAM type and all the others, but it all boils down to what each one wants (and has the money to spare). IMHO I wouldn't go with QVO SSDs and non redundant volumes (especially spanning 2 disks), but hey that's just me :-)

moondev · 2 years ago

Has anyone built or attempted to build their own flash array, as in 20+ nvme u.2 disks in a hot swap chassis and consume via nvme-of or nvme-tcp

I see there are tri-mode backplanes but looking for something that skips the hba and connects via something like occulink directly?

rtp4me · 2 years ago

We are currently testing a number of systems with 12x 30TB NVMe drives with Debian 12 and ZFS 2.2.0. Each of our systems have 2x 128G EPYC CPUs, 1.5TB of RAM, and a dual-port 100GbE NICs. These systems will be used to run KVM VMs plus general ZFS data storage. The goal is to add another 12x NVMe drives and create an additional storage pool.

I have spent an enormous amount of time over the past couple of weeks tuning ZFS to give us the best balance of reads-vs-writes, but the biggest problem is trying to find the right benchmark tool to properly reflect real-world usage. We are currently using FIO but sheer number of options (depth queue, numjobs, libaio vs io_uring) makes the tool unreliable.

For example, comparing libaio vs io_uring with the same options (numbjobs, etc) makes a HUGE different. In some cases, io_uring gives us double (or more) performance than libaio, however, io_uring can produce numbers that don't make any sense (eg: 105GB/sec reads for a system that maxes out at 72B/sec). That said, we were able to push > 70GB/secs large-block reads (1M) from 12x NVMe drives which seems to validate ZFS can perform well on these servers.

OpenZFS has come a long way from the 0.8 days, and the new O_DIRECT option coming out soon should give us even better performance for the flash arrays.

mgerdts · 2 years ago

If you are seeing unreasonably fast read throughput, it is likely that reads are being served from the ARC. If your workload will benefit from the ARC, you may be seeing valid numbers. If your workload will not benefit from the ARC, set primarycache=metadata on the dataset and rerun your test, potentially with a pool export/import or reboot to be sure the cache is cleared.

The fact that fio has a bunch of options doesn’t make the tool unreliable. Not understanding the tool or what you are testing makes you unreliable as a tester. The tool is reliable. As you learn you will become a more reliable tester with it.

ewwhite · 2 years ago

There are definitely better ways to benchmark.

I design similar NVMe-based ZFS solutions for specialized media+entertainment and biosciences workloads and have put massive time into the platform and tuning needs.

Also think about who will be consuming data. I've employed the use of an RDMA-enabled SMB stack and client tuning to help get the best I/o characteristics out of the systems.

_zoltan_ · 2 years ago

I did but bought a full box, didn't assemble myself.

temp0826 · 2 years ago

Why disable swap? And...only 8gb of ram (am I misreading that?) with zfs? It's been years since I've needed a bunch of storage but I remember zfs being much happier with gobs of ram.

ZFS doesn’t need much ram unless you have specific needs. People have run basic zfs storage machines on 4GB of memory total and it works fine. Even 2GB has been done before, but that’s a bit too low for me to suggest. Broadly, 16GB of total system is personal general recommended baseline, because honestly it’s simply too cheap not to have it.

ZFS needs some amount of ram in order to even load the pool, this only becomes a practical concern when your pool gets into the 100’s of TiB.

Deduplication, which should generally not be used, used to be awful with ram consumption, but these days isn’t nearly as bad and can be surprisingly viable. Dedupe can still cause unexpected performance issues unless your skillset is into digging into system analysis and tuning.

You need some amount of RAM buffer for the write TXGs to coalesce efficiently. Generally not a concern.

Finally there’s ARC, which is where all the nice things that improve your experience happen. The more the better, but just like system ram, once you have enough for your usage profile, you stop noticing much benefit beyond that. For dumb file storage, not that much is needed. Ideally you want enough to keep all the metadata, plus whatever your actual repeated read access would be. Working on video editing and VMs would require more RAM for a more optimal experience.

Like I mentioned it's been a long time since I touched this stuff. I do think I set up my (underpowered) system aggressively with ARC now that you mention it. Probably was experimenting with dedup as well. I'm sure some of those things have had improvements over the years.

squarefoot · 2 years ago

I'm using XigmaNAS with two zfs pools and only 4G RAM since like 8 years, also on a seemingly underpowered machine (Atom D410). Granted that it doesn't fly, but for home use is more than enough say for serving on the home LAN multiple HD movies at the same time, provided one doesn't load it with other more heavy services. Mine is currently only serving files through NFS and SMB and bittorrent; all else is turned off.

https://ibb.co/BNMWFXm

BTW, currently working around a bigger one that will run on a faster mini PC and a 8 bay USB3.1 enclosure. I'm a little wary about USB connectors in this context, and I'll likely have to secure cables firmly to avoid wearing and accidental pulls, but so far results on the bench are promising.

e12e · 2 years ago

IIRCC it's not recommend to have swap on a zvol. So you would need a non-zfs managed partition/device for swap.

Not sure about current status?

https://github.com/openzfs/zfs/issues/7734

https://github.com/openzfs/zfs/issues/342

https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSForSwapM...

usefulcat · 2 years ago

He already has a separate NVMe drive for the OS, he could put the swap on that.

Swap on a zvol I understand, disabling it entirely seems questionable, no?

ciupicri · 2 years ago

> My main motivation for using ZFS instead of ext4 is that ZFS does data checksumming, whereas ext4 only checksums metadata and the journal, but not data at rest.

There's also dm-integrity [1]:

> The dm-integrity target emulates a block device that has additional per-sector tags that can be used for storing integrity information.

One more thing: where is the ECC RAM?

[1]: https://docs.kernel.org/admin-guide/device-mapper/dm-integri...