More likely that their core database hit some scaling limit and fell over. Their status page talks constantly about them working with their "upstream database provider" (presumably AWS) to find a fix.
My guess. They use AWS hosted Postgresql and autovacuuming fell permanently behind without them noticing, and can't keep up with organic growth, and they can't scale vertically because they already maxed that out before. So they have to do crash migrations of data off their core DB which is why it's taking so long.
Feature factories falling apart always remind me of the stories doctors tell about patients showing up with symptoms that the doctor could have done something about two years ago. Like diabetes so far progressed that amputation is now on the list of possibilities.
Nobody asks for help when the help can still be productive. It's always the death bed conversion.
An outage of this magnitude is almost ALWAYS the direct and immediate fault of senior leaderships priorities and focus. Pushing too hard in some areas, not listening to engineers on needed maintenance tasks etc.
And engineers never are the cause of mistakes? There can't possibly be any data to back up that major outages are more often caused by leadership. I've been in SIEs simply because someone pushed a network outage to a switch network. Statements like these only go to show how much we have to learn, humble ourselves, and stop blaming others all the time.
Leadership can include engineers responsible for technical priorities. If you're down for that long though, it's usually an organizational fuck-up because the priorities didn't include identifying and mitigating systemic failure modes. The proximate cause isn't all that important and the people who set organizational priorities are by-and-large not engineers.
Think of airplane safety. I think it is similar. A good culture can make sure $root-cause is more likely detected, tested, isolated, monitored, easy to roll back and so on.
Hugops to the people working on this for the last 31+ hours.
Running incidents of this significance is hard, draining and requires a lot of effort, this going on for so long must be very difficult for all involved.
Prediction: Someone confidently broke something, then confidently 'fixed' it, with the consequence of breaking more things instead. And now they have either been pulled off of the cleanup work or they wish they had been.
Wow >31h I am surprised they couldnt rebuild their entire systems in parallel on new infra in that time. Can be hard if data loss is invokved tho (a guess). Would love to see the post mortem so we all can learn.
I doubt it’s infra failure but software failure. Their bad design has caught up and they can’t toss more hardware for some reason. Most companies have this https://xkcd.com/2347/ in their stack and it’s fallen over.
Lots get starry-eyed and aim for five nines right out of the gate where they should have been targeting nine fives and learning from that. Walk before you run.
> Change controls are tighter, and we’re investing in long-term performance improvements, especially in the CMS.
This reads as if overall performance was an afterthought and this doesn’t seem practical; it should be a business metric, it is important to the users after all.
Then again, it’s easy to comment like this in hindsight. We’ll see what happens long term.
My guess is reason they been down so long is they don’t have good rollback so they attempting to fix forward with limited success.
My guess. They use AWS hosted Postgresql and autovacuuming fell permanently behind without them noticing, and can't keep up with organic growth, and they can't scale vertically because they already maxed that out before. So they have to do crash migrations of data off their core DB which is why it's taking so long.
Nobody asks for help when the help can still be productive. It's always the death bed conversion.
Edit, an outage of this length smells of bad systems architecture...
Four nines is not what I would be citing at this point. (That's less than an hour per year, so they burned that for next three decades)
Maybe aim for 99% first.
Otherwise a pretty honest and solid response, kudos for that!
I always strive for 7 9s myself, just not necessarily consecutive digits.
Deleted Comment
This reads as if overall performance was an afterthought and this doesn’t seem practical; it should be a business metric, it is important to the users after all.
Then again, it’s easy to comment like this in hindsight. We’ll see what happens long term.