Post mortem on Mastodon outage with 30k users

Quite beefy hardware for on-prem. Perhaps someone could explain to me why 30k users, even assuming concurrent users would be an issue for hardware that size?

Is the app stack naturally resource heavy or is this setup particular different to how a instance should be?

tedivm · 3 years ago

There are two problems that come up with scaling-

1. Those 30k users who are following users on other servers, which pulls in the content of those users. I'm on hachyderm but I would guess only about 20% of the people I follow are. That means my user is pulling in 250 other users data into the system. Of course most, if not all, of the people are follow are being followed by someone else so it's not a pure multiplier. At the same time it does mean it's a lot more data moving in.

2. NFS, which is where they had the problems, was being used as a media store. People on mastodon and twitter like sharing images and other media. Even people who run single user nodes but follow a lot of people end up using a ton of storage space. 30k people scrolling through timelines and actively pulling that data out, while queues are pushing data in, can be tough to scale. Switching to an object store really helped fix that.

On top of that the mastodon app is very very sidekiq heavy. For those not familiar with ruby, sidekiq is basically a queue workload system (similar to python's celery). You scale up by having more queue workers running. The problem with NFS is that all of those queues are sharing the filesystem, making the filesystem a point of failure and scaling pain. Adding more queue workers makes the problem worse by adding to the filesystem load, rather than resolving the problems. Switching to an object store helps until the next centralized service (in this case postgres) reaches its limits.

So basically the 30k users each following their own set of users creates a multiplier on how many users the instance is actually working with. The more users on either side of the equation the more work that needs to be done. If this was a 30k user forum where every user existed on the instance the load would be significantly less.

throw0101c · 3 years ago

> The problem with NFS is that all of those queues are sharing the filesystem, making the filesystem a point of failure and scaling pain.

It is not NFS that is a SPOF, it is a single NFS server that is a SPOF. There exists distributed NFS systems (OneFS, Panasas) that can tolerate the loss of up to N servers before the service gets disrupted.

bbu · 3 years ago

>Perhaps someone could explain to me why 30k users, even assuming concurrent users would be an issue for hardware that size?

the main problem was slow io because of faulty disks which brought everything to a crawl.

denysvitali · 3 years ago

Ruby On Rails... On top of that, the federation part is basically ingestion of all the federated servers across the world, so Postgresql would see a constant write despite the users of the instance aren't posting anything

bioemerl · 3 years ago

I'm absolutely amazed that people are going with federation of this form with this massive overhead instead of something like RSS, but beefed up to a standard, where you can just point your client to multiple web servers and have them use your server as identity provider and that's it.

Imagine your subscriptions and account living on one server, then when you log in that server gives you the list and your client goes and gets all the data.

We already had this sort of federation figured out, it's the open web. We just have to find a way to get the open web to provide the things that google, facebook, reddit, provided.

Easy way to contribute content.

Discovery for new content.

Search ability.

Kill the things centralized websites provided, let people host websites within that system, and let the clients handle dealing with the fact that there are all sorts of providers out there.

CommanderData · 3 years ago

I hate to be the person but I've seen complicated dynamic applications push much higher bandwidth and serve millions of concurrent users with similar if not smaller h/w requirements.

Would be interesting seeing Twitters complete backend and while mastodon might not be apples to apples also interested cost per user to infrastructure analysis too.

zimpenfish · 3 years ago

Looking at the amount of requests my 5 user (following ~60 people in total), 90% idle Pleroma instance handles is a bit mad - 60-90 requests a minute.

I dread to think how much a busy 30k user instance does.

Kye · 3 years ago

Bad disks.

wlonkly · 3 years ago

And for a while, NFS on those bad disks.

Confused as to why they didn't just replace the bad SSDs with good ones?

Fwiw this sounds to me like what happens when you use "retail" SSDs (drives marketed for use in user laptops) underneath a high write traffic application such as a relational database. Often such drives will either wear out or will turn out to have pathological performance characteristics (they do something akin to GC eventually), or they just have firmware bugs. Use enterprise rated drives for an application like this.

kris-nova · 3 years ago

Hi, I made the decision not to replace the drives. I also wrote the article, and am the admin of Hachyderm.

So to be clear, we did try to "offline" a drive from the ZFS pool just to see if this was a viable path. The ZFS pool was set up a few years ago and has gone through a few iterations of disks. The mirrors were unbalanced. We had pairs of drives of one manufacturer/speed mirrored with pairs of drives from another manufacturer/speed. We know this configuration was wrong, again we didn't intend for our little home lab to turn into a small production service.

I think after spending a few hours trying to "offline" the disk, and then repairing the already brittle ZFS configuration to getting the database/media store back to a "really broken and slow but still technically working" state we just decided to pull the plug and move to Hetzner. Offlining the disk caused even more cascading failures and took about 30 minutes just for the software. We could have technically shut down production to try without the database running on it, but at that point we decided to just get out of the basement.

If it would have been as easy as popping a disk in/out of the R630 (like one would imagine) we would have certainly done that.

To be honest I am still very interested in performing more analysis on ZFS on a 6.0.8 Linux kernel. I am not convinced ZFS didn't have more to do with our problems than we think. I will likely do a follow up article on benchmarking the old disks with and without ZFS in the future.

zfs-2.1.4-1 zfs-kmod-2.1.6-1 6.0.8-arch1-1

rbanffy · 3 years ago

> We had pairs of drives of one manufacturer/speed mirrored with pairs of drives from another manufacturer/speed.

The different speed is an issue, but I always recommend mixing pairs so that you don’t end up like me, when all spinning metal of the same RAID-5 array failed in a short period. Wasn’t a great day.

Lucky me I had a contingency plan.

ilyt · 3 years ago

Throw ZFS away, put X drives, make RAID10+LVM with X-1 drives (linux supports odd numbers in RAID10), never think about it again. It's simple to setup, simple to debug, and you don't need ZFS expert for something as simple as disk replacement. In cases like what happened there is --write-mostly option that will tell linux raid to prefer other disks for reads so yo can see whether unloading the drive changes anything. Maybe RAID6 if you're not screaming for performance but want some more space.

Focus your efforts on making robust backups instead. You don't want to be that only guy in org who knows how to do ZFS things when it breaks.

We're running few racks of servers, ZFS is delegated to big boxes of spinning rust where its benefits (deduplication/compression) are used well, but on a bunch of SSDs it is just overkill.

ilyt · 3 years ago

Then you will have same problems but now you can bother manufacturer about it!

Also unless there is something horribly wrong about how often data is written, that SSD should run for ages.

We ran (for a test) consumer SSDs in busy ES cluster and they still lasted like 2 years just fine

The whole setup was a bit of overcomplicated too. RAID10 with 5+1 or 7+1 (yes Linux can do 7 drive RAID10) with hotspare woud've been entirely fine, easier, and most likely faster. You need backups anyway so ZFS doesn't give you much here, just extra CPU usage

Either way, monitoring wait per drive (easy way is to just plug collectd [1] into your monitoring stack, it is light and can monitor A TON of different metrics)

* [1]https://collectd.org/

Dynamicd · 3 years ago

It costs money.

Remember this isn't a company: its hobbyist/enthusiasts putting their own resources into something or running with donations when available. There's no venture capital to absorb operating losses here. Remember the old "storm/norm/conform/perform" analogy. We are still very much pushing along into norm territory, and articles like this will help establish a conform phase ... but it will take time.

zufallsheld · 3 years ago

> Confused as to why they didn't just replace the bad SSDs with good ones?

Probably because they wanted to migrate to hetzner anyway and took the chance to do it now instead of later.

But I do agree that it would have been probably a better idea.

rubiquity · 3 years ago

Not sure why Ruby on Rails is taking a beating in the comments section here. The problem is clearly the 1Gbps network that is functioning at only 200Mbps and worn out/defective SSDs. Waiting around on IO all day will bring any stack to a crawl.

vidarh · 3 years ago

There's a number of people here who'll complain about Rails whether or not it has anything to do with the actual problem whenever a Rails app is brought up.

yoz · 3 years ago

This is a superb write-up of an intense, exhausting situation. Great mixture of low-level detail and tactics, and high-level thinking about systems and people. Congratulations on managing that migration, and thank you for sharing this with us!

dboreham · 3 years ago

lima · 3 years ago

Hetzner is great, but it may not be the best choice for a social network that hosts user content and may attract controversy.

As a mass-market hosting provider, Hetzner is subject to constant fraud, abuse and hacked customer servers, and in consequence, their abuse department is very trigger-happy and will usually shoot first and ask questions later. They can and will kick out customers that cause too much of a headache, regardless of their ToS.

Their outbound DDoS detection systems are very sensitive and prone to false positives, such as when you get attacked yourself and the TCP backscatter is considered a portscan. If the system is sufficiently confident that you are doing something bad, it automatically firewalls off your servers until you explain yourself.

Likewise, inbound abuse reports sometimes lead to automated or manual blocks before you can respond to them.

They also rate limited or blocked entire port ranges in the past to get rid of Chia miners and similar miscreants with no regards to collateral damage to other services and without informing their other customers.

Their pricing is good and service is otherwise excellent, and if you do get locked out, you can talk to actual humans to sort it out. But, only after the damage is already done. If you use them, have a backup plan.

The main project instances run on Hetzner. It seems like it's either not an issue or Hetzner's anti-abuse systems are able to tell the difference.

asim · 3 years ago

As someone who scaled ruby on rails in the prime era 2007-2009 I'll tell you the problems have not changed. It's very straightforward horizontal scaling followed by load balancing across multiple nodes. Load relates having enough cores, fast enough disks and enough egress bandwidth throughput. Everything else is purely caching in front of a poorly performing ruby web server and minimising disk or database reads.

The write up is cool. Reminiscent of things we used to do back in that early rails 2-3 era. Just funny we're back where we started.

TLDR: if you want to run ruby on rails on bare metal be ready to run something with 8+ cores, 10k rpm disks minimum and more bandwidth than you can support out of your basement.

pantulis · 3 years ago

Also, serving images from shared NFS mounts was not the best of ideas. I remember when S3 came out and the Rails attachment plugins ecosystem quickle added support; it was a godsend.

even some hacked setup with 2 way sync is better. The dumbest (but still somehow the least problem in their architecture) I saw was running [1] syncthing on few servers and just syncing everything and hoping conflicts won't happen...

Like, I did it, but I wouldn't recommend it, restarting NFS server is gnarly, HA support on OSS side is... not really there last time I checked, and overall PITA.

* [1] https://syncthing.net/

neonsunset · 3 years ago

Weak technology stack and deeply flawed concept of federation that enables local centralization of control by discord-mods-meme style with all the corresponding issues.

Mastodon should have been based on DHT with each "terminal" aka "profile" having much higher autonomy.

Otherwise, it just gives more tools to people who left Twitter to continue doing same societal damage.

p.s.: it is time to stop writing back-ends in Ruby when every other popular alternative (sans Python-based ones) is more powerful and scalable.

IMHO decentralization is not the way to go, which is why I started OpenDolphin [1].

If you end up having a "decentralized system" with 30k users per instance, you basically just have a centralized system that federates with other instances. Sure, it is kind of decentralized, but the admins of that 30k instance are effectively able to read the DMs, impersonate users and delete their content.

I personally think (and I'm trying to formalize my ideas somehow with OpenDolphin) that a centralized instance that is only used to serve the __signed__ / encrypted content solves some parts of the decentralization issues we're seeing here - whilst still giving the users some of the features of decentralized platforms.

If you like / dislike the idea, help us out! We're trying to build a community to build together something great. Every contribution counts (:

And btw, yes, I do agree: that hardware for 30k users doesn't make any sense - it really shows that something isn't optimized :(

[1]: https://about.opendolphin.social/

A centralised instance of any kind is a non-starter for a lot of those of us who have moved to Mastodon. And, yes, the admins of that 30k instance is able to do all kinds, and their users are able to leave if they do. I'd be all for improvements around signing and encryption, but not at the cost of centralisation (for my part, I run my own Mastodon server, but is also tinkering with my own ActivityPub implementation).

> Sure, it is kind of decentralized, but the admins of that 30k instance are effectively able to read the DMs, impersonate users and delete their content.

There's a working group on end to end encryption already and I do believe they will solve that problem.

Archiving content is trivial and can be automated. It's also pretty easy to migrate between instances- I started on one run by a friend but which I felt was too small, moved to one of the biggest ones, and then ended up on hachyderm. I may end up moving again as I feel like the service is getting rather big- it's one of the largest instances now and there are benefits to being on smaller instances that tend to push people towards them.

CharlesW · 3 years ago

I think it's neat to start a Twitter competitor and build it in the open, but I don't understand how you're doing to get traction without federation or some kind of undeniable USP. Even Twitter had to "federate" with SMS at first.

Tip: On https://about.opendolphin.social/about, the word "Retribution" is probably not the word you want here. "Compensation", maybe?

I think just having public key posted in DNS record + the subscribers saving both "human name" and public key would be enough; even if someone takes over a domain every subscriber will get an alert that they key have changed; maybe have backup method of just storing `/.well-known/<social-network>/username with same info for those that don't/can't fuck with DNS.

Building on that the list of address+pubkeys can be put somewhere to search (DHT on server nodes?) so if someone moves shop they can still be found. Then the client could either subscribe directly to who they want (akin to RSS), or get on someone's instance (which would be akin to RSS aggregator) to participate in their community.

Disambiguating DHT in case it's helpful for others, even though I'm still unclear how this would be better than what ActivityPub does now: https://en.wikipedia.org/wiki/Mainline_DHT

MuteXR · 3 years ago

How would a DHT be any better than AP? It's not even close to doing the same thing, very few people would want something that comes with all the massive disadvantages that a DHT would bring.

aprdm · 3 years ago

Shopify, Basecamp and github sort of disagree with you.

cyberphobe · 3 years ago

These comments make me want to log off for a bit.

Post: we hit scaling issues caused by our failing disks and running image hosting and databases over NFS

HN: It’s obviously Ruby on Rails fault

Hi. I wrote the post. Additionally I am responsible for operating Hachyderm (Ruby on Rails) and GitHub (Ruby on rails) for both my free-time and my day job.

I can say with certainty that Ruby specifically was not the bottleneck in our case. I do think that the rails paradigm can often lead to interdependent systems. We see this at GitHub and we also see this in Mastodon. Service A will do reads/writes against the same tables in the database that Service B also does. When service A is moved to an isolation zone, it can still impact Service B's performance.

In other words, I think any stateful framework with the flexibility that Ruby on Rails encourages bad behavior that can contribute to a noisy neighbor problem.

The point I am trying to drive home is that I agree. I can confidently say that Ruby on Rails is not the culprit in our case. To be honest I just ignore anyone who is quick to point fingers and assign blame either technically or personally.

Sorry hacker news got you down. If it helps my family and I are making Sunday morning pancakes with my puppy Björn today and we are all wishing you the best day ever.

aw thanks! I follow you on mastodon and it’s been great watching the updates as hachyderm grows

Is there any good architectural breakdown of Mastodon software?

It's surreal. I thought maybe I misread the post or something.