6 Raspberry Pis, 6 SSDs on a Mini ITX Motherboard

Aside from running Ceph as my day job, I have a 9-node Ceph cluster on Rasberry Pi 4s at home that I've been running for a year now, and I'm slowly starting to move things away from ZFS to this cluster as my main storage.

My setup is individual nodes, with 2.5" external HDDs (mostly SMR), so I actually get sligtly better performance than this cluster, and I'm using 4+2 erasure coding for the main data pool for CephFS.

CephFS has so far been incredibly stable and all my Linux laptops reconnect to it after sleep with no issues (in this regard it's better than NFS).

I like this setup a lot better now than ZFS, and I'm slowly starting to migrate away from ZFS, and now I'm even thinking of setting up a second Ceph cluster. The best thing with Ceph is that I can do a maintenance on a node at any time and storage availability is never affected, with ZFS I've always dreaded any kind of upgrade, and any reboot requires an outage. Plus with Ceph I can add just one disk at a time to the cluster and disks don't have to be the same size. Also, I can move the physical nodes individually to a different part of my home, change switches and network cabling without an outage now. It's a nice feeling.

Infernal · 3 years ago

I want to preface this - I don't have strong opinion already here, and I'm curious about Ceph. As someone who runs a 6 drive raidz2 at home (w/ ECC RAM) does your Ceph config give you similar data integrity guarantees to ZFS? If so, what are the key points of the config that enable that?

antongribok · 3 years ago

When Ceph migrated from Filestore to Bluestore, that enabled data scrubbing and checksumming for data (older versions before Bluestore were only verifying metadata).

Ceph (by default) does metadata scrubs every 24 hours, and data scrubs (deep-scrub) weekly (configurable, and you can manually scrub individual PGs at any time if that's your thing). I believe the default checksum used is "crc32c", and it's configurable, but I've not played with changing it. At work we get scrub errors on average maybe weekly now, at home I've not had a scrub error yet on this cluster in the past year (I did have a drive that failed and still needs to be replaced).

My RPi setup certainly does not have ECC RAM as far as I'm aware, but neither does my current ZFS setup (also a 6 drive RAIDZ2).

Nothing stopping you from running Ceph on boxes with ECC RAM, we certainly do that at my job.

kllrnohj · 3 years ago

I was running glusterfs on an array of ODROID-HC2s ( https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/ ) and it was fun, but I've since migrated back to just a single big honking box (specifically a threadripper 1920x running unraid). Monitoring & maintaining an array of systems was its own IT job that kinda didn't seem worth dealing with.

trhway · 3 years ago

Looking at that ODROID-HC2 i wonder when the drive manufacturers would just integrate such a general computer board onto the drive itself.

magicalhippo · 3 years ago

If you take say old i7 4770k's, how many of those along with how many disks would you need to get 1GB/s sustained sequential access with Ceph?

My single ZFS box does that with ease, 3x mirrored vdevs = 6 disks total, but I'm curious as the flexibility of Ceph sounds tempting.

antongribok · 3 years ago

I just setup a test cluster at work to test this for you:

4 nodes, each node with 2x SAS SSDs, dual 25Gb NICs (one for front-end, one for back-end replication). The test pool is 3x replicated with Snappy compression enabled.

On a separate client (also with 25Gb) I mounded an RBD image with krbd and ran FIO:

  fio --filename=/dev/rbd1 --direct=1 --sync=1 --rw=write --bs=4096K --numjobs=1 --iodepth=16 --ramp_time=5 --runtime=60 --ioengine=libaio --time_based --group_reporting --name=krbd-test --eta-newline=5s

I get a consistent 1.4 GiB/s:

  write: IOPS=357, BW=1431MiB/s (1501MB/s)(83.9GiB/60036msec)

bityard · 3 years ago

Is 9 the minimum number of nodes you need for a reasonable ceph setup or is that just what you arrived at for your use case?

lathiat · 3 years ago

For the standard 3x replicated setup, 3 nodes is the minimum for any kind of practical redundancy but you really want 4 so that after failure of 1 node all the data can be recovered onto the other 3 and still have failure resiliency.

For erasure coded setups which is not really suited to block storage but mainly object storage via radosgw(s3) or cephfs you need minimum k+m and realistically k+m+1. That would translate to 6 minimum but realistically 7 nodes for k=4,m=2. That’s 4 data chunks and 2 redundant chunks which means you use 1.5x the storage of the raw data (half that of a replicated setup). You can do k=2,m=1 also. So 4 nodes into that case.

antongribok · 3 years ago

I would say the minimum is whatever your biggest replication or erasure coding config is, plus 1. So, with just replicated setups, that's 4 nodes, and with EC 4+2, that's 7 nodes. With EC 8+3, which is pretty common for object storage workloads, that's 12 nodes.

Note, a "node" or a failure domain, can be configured as a disk, an actual node (default), a TOR switch, a rack, a row, or even a datacenter. Ceph will spread the replicas across those failure domains for you.

At work, our bigger clusters can withstand a rack going down. Also, the more nodes you have, the less of an impact it is on the cluster when a node goes down, and the faster the recovery.

I started with 3 RPis then quickly expanded to 6, and the only reason I have 9 nodes now is because that's all I could find.

geerlingguy · 3 years ago

I've seen setups with as few as 2 nodes with osds and a monitor (so 3 in total), but I believe 4-5 nodes is the minimum recommendation.

Deleted Comment

underwater247 · 3 years ago

I would love to hear more about your Ceph setup. Specifically how you are connecting your drives and how many drives per node? I imagine with the Pis limited USB bus bandwidth, your cluster performs as more of an archive data store compared to realtime read/write like the backing block storage of VMs. I have been wanting to build a Ceph test cluster and it sounds like this type of setup might do the trick.

antongribok · 3 years ago

Each node is completely separate, housed in a good quality aluminum enclosure with a fan, and sitting on top of an external USB Seagate 2.5" portable drive (either 4TB or 5TB), connected via USB 3 cable. I'm pretty sure these drives are SMR, but they've been good to me, and they're fast enough for my needs.

Power is provided either using official RPi power supplies, or a couple of multi-port Anker USB power supplies that I had previously. A limit of 2.5 amps does not seem to cause any issues.

Currently everything is connected to a single switch, but I move things around my office sometimes, and sometimes have the RPis connected to two different switches.

Right now, everything including the 1 switch is connected to a single APC UPS, and that thing is super old, so that's another SPOF.

My clients currently are a few wired desktops and laptops over wifi, all connecting via CephFS. I haven't tested with librbd or krbd, I imagine it wouldn't be fast.

The RPis are mostly 8GB, but I do have a couple 4GB, and one RPi 400, which is kind of hilarious.

Everything is running Ubuntu 20.04, Ceph Pacific, and deployed from the first node with cephadm.

I use only Samsung microSD cards, either 32GB or 64GB. I don't think it matters what kind, but getting bigger cards makes me feel like they'll last longer. Most of the nodes have the var partition on the external drives (on a small partition at the beginning of the drive), but I do have a few where I didn't set that up early on, and haven't gotten around to redoing it.

I partition the drives and put LVM on manually, and tell cephadm to use the specific LV instead of the bare drive.

If you want any kind of performance, definitely set your expectations very low, but for me this works. I can stream at least a pair of 4K movies off this simultaneously, and I also run an instance of Paperless-NG off this over a CephFS mount and haven't had any issues.

sgarland · 3 years ago

I tried using Ceph twice at home. Once was via Proxmox, and it installed and ran perfectly fine, although tbf I didn't load it with much.

The next was via Rook, since I have a Kubernetes cluster, and it was a nightmare. I spent a week or so reading through all the docs I could find before I felt prepared to go through with it, only to have random clock sync issues that Reddit informed me were due to me enabling power savings mode in the BIOS for my nodes.

ZFS's biggest hiccup for me is when I do a kernel update and DKMS borks the upgrade. Other than that, it's been rock-solid. I run a normal and backup node with it, no regrets.

gkhartman · 3 years ago

I solved the ZFS DKMS bork issue by moving to Debian 11 from Centos 8. I've had zero openZFS issues since the move. On Centos it would require work every time a sufficient kernel upgrade came in.

Since I'm familiar with RHEL I just swapped some of the Debian default services for RHELish alternatives (Firewalld, Podman, etc.).

matheusmoreira · 3 years ago

> Plus with Ceph I can add just one disk at a time to the cluster and disks don't have to be the same size.

I'd like to note that ZFS now has RAID-Z expansion which allows us to do exactly that! It's an essential feature for home users since it allows us to gradually expand capacity instead of buying up all the storage up front at great cost.

I too researched ceph for this exact reason but was told the hardware requirements were too high for a typical home lab, yet you're running ceph on raspberry pis... I should probably look into ceph once more.

fomine3 · 3 years ago

will come but not yet https://github.com/openzfs/zfs/pull/12225

preisschild · 3 years ago

I'm also running ceph (using the rook kubernetes operator) in my homelab. Been running this setup for 9 months now with 2 cheap HP elitedesk workstations i picked up on ebay and 2 8TB HDDs in each.

Since this setup has run incredible smooth so far, I plan on using SolidRun's HoneyComb LX2 as a ceph node with bigger disks and nvme write cache in the future. I looked at the raspberry pi 4, but was not too impressed by the single PCIe 3.0 lane, since I also plan on using NVME disks as ceph's metadata storage device to speed up the hard disk with the normal data behind it and the ceph recommendation to use 10GbE NICs.

The HoneyComb LX2 has 4 built in 10GbE ports, 16 A72 cores, actual DDR4 RAM slots, a 4 lane PCIe 3.0 m.2 slot and an open-ended (so you can put in a full x16 device) PCIe 3.0 slot with 8 lanes for a max of 8Gbyte/s bandwidth.

Since it's an arm box it's incredible energy efficient which is important since energy prices are increasing in my country. Also its the only affordable performant arm device at 800USD.

ptman · 3 years ago

Hetzner used to offer vms backed by ceph block devices on hetzner.cloud, but got rid of them recently. Wonder why?

Man, Ceph really doesn't get enough love. For all the distributed systems hype out there - be it Kubernetes or blockchains or serverless - the ol' rock solid distributed storage systems sat in the background iterating like crazy.

We had a huge Rook/Ceph installation in the early days of our startup before we killed off the product that used it (sadly). It did explode under some rare unusual cases, but I sometimes miss it! For folks who aren't aware, a rough TLDR is that Ceph is to ZFS/LVM what Kubernetes is to containers.

This seems like a very cool board for a Ceph lab - although - extremely expensive - and I say that as someone who sells very expensive Raspberry Pi based computers!

mcronce · 3 years ago

Ceph is fantastic. I use it as the storage layer in my homelab. I've done some things that I can only concisely describe as super fucked up to this Ceph cluster, and every single time I've come out the other side with zero data loss, not having to restore a backup.

erulabs · 3 years ago

Haha "super fucked up" is a much better way of describing the "usual, rare" situations I was putting it into as well :P

geerlingguy · 3 years ago

I think many people (myself included) had been burned by major disasters on earlier clustered storage solutions (like early Gluster installations). Ceph seems to have been under the radar for a bit of time when it got to a more stable/usable point, and came more in the limelight once people started deploying Kubernetes (and Rook, and more integrated/wholistic clustered storage solutions).

So I think a big part of Ceph's success (at least IMO) was its timing, and it's adoption into a more cloud-first ecosystem. That narrowed the use cases down from what the earliest networked storage software were trying to solve.

3np · 3 years ago

We're more and more feeling we made the wrong call with gluster... The underlying bricks being a POSIX fs felt a lot safer at the time but in hindsight ceph or one of the newer ones would probably have been a better choice. So much inexplicable behavior. For your sake I hope the grass really is greener.

rwmj · 3 years ago

Red Hat (owner of Gluster) has announced EOL in 2024: https://access.redhat.com/support/policy/updates/rhs/

Ceph is where the action is now.

imiric · 3 years ago

Can someone with experience with Ceph and MinIO or SeaweedFS comment on how they compare?

I currently run a single-node SnapRAID setup, but would like to expand to a distributed one, and would ideally prefer something simple (which is why I chose SnapRAID over ZFS). Ceph feels to enterprisey and complex for my needs, but at the same time, I wouldn't want to entrust my data to a simpler project that can have major issues I only discover years down the road.

SeaweedFS has an interesting comparison[1], but I'm not sure how biased it is.

[1]: https://github.com/seaweedfs/seaweedfs#compared-to-ceph

BlackLotus89 · 3 years ago

Seaweedfs has problems with large "pools" it's based on an old facebook paper (haystack) and supposed for block storage to distribute large image caches. I found it mediocre at best as it's documentation was lacking, performance was lacking (in my tests) and the multitude of components were hard to get working. The idea behind it is that every daemon uses one large file as data store to skip slow metadata access. There are different ways to access the storage over gateways.

MinIO is changing so much in the last years thatI can't give a competent answer but compared to seaweedfs it uses many small local databases. Right now it's deprecating many features like the gateway and it is split into 2 main components (cli and server) compared the seaweedfs deployment is dead simple, but I don't know which direction the project is going. Went from a normal open source project to a more business like deal (in what I saw) like I said I didn't quite follow the process.

Ceph is based on blocmlk storage. Offers an object gateway (s3/swift), fs (cephfs) and block storage (rbd). You can access everything through librados directly as well. For a minimal setup you need a "larger" vluster but it is the most flexible solution (imho). Uses the most resources as well, but you can do nearly everything you want without limit with it.

halbritt · 3 years ago

I love it, but when it fails at scale, it can be hard to reason about. Or at least that was the case when I was using it a few years back. Still keen to try it again and see what's changed. I haven't run it since bluestore was released.

teraflop · 3 years ago

Yeah, I've been running a small Ceph cluster at home, and my only real issue with it is the relative scarcity of good conceptual documentation.

I personally learned about Ceph from a coworker and fellow distributed systems geek who's a big fan of the design. So I kind of absorbed a lot of the concepts before I ever actually started using it. There have been quite a few times where I look at a command or config parameter, and think, "oh, I know what that's probably doing under the hood"... but when I try to actually check that assumption, the documentation is missing, or sparse, or outdated, or I have to "read between the lines" of a bunch of different pages to understand what's really happening.

nik736 · 3 years ago

Ceph seems to be always related to big block storage outages. This is why I am very wary of using it. Has this changed? Edit: rephrased a bit

antongribok · 3 years ago

Ceph is incredibly stable and resilient.

I've run Ceph at two Fortune 50 companies since 2013 to now, and I've not lost a single production object. We've had outages, yes, but not because of Ceph, it was always something else causing cascading issues.

Today I have a few dozen clusters with over 250 PB total storage, some on hardware with spinning rust that's over 5 years old, and I sleep very well at night. I've been doing storage for a long time, and no other system, open source or enterprise, has given me such a feeling of security in knowing my data is safe.

Any time I read about a big Ceph outage, it's always a bunch of things that should have never been allowed in production, compounded by non-existent monitoring, and poor understanding of how Ceph works.

GB5 Scores (Single: 763/ Multi: 7193) That's roughly 80% the performance of my current x86 desktop. Ubuntu 20.04 LTS 12 Core Cortex A78AE v8.2 64 bit. 2.20 Ghz 32 GB LDPDDR5 memory, 256 bit, 204.8 GB/s NVIDIA graphics/AI acceleration. PCI-E Slot 64 Mb eMMC 5.1 M.2 PCIe Gen 4 Display Port 1.4a +MST