Readit News logoReadit News
stackskipton · 5 months ago
My SRE brain reading between the lines is they have been feature factory and tech debt finally caught up to them.

My guess is reason they been down so long is they don’t have good rollback so they attempting to fix forward with limited success.

qcnguy · 5 months ago
More likely that their core database hit some scaling limit and fell over. Their status page talks constantly about them working with their "upstream database provider" (presumably AWS) to find a fix.

My guess. They use AWS hosted Postgresql and autovacuuming fell permanently behind without them noticing, and can't keep up with organic growth, and they can't scale vertically because they already maxed that out before. So they have to do crash migrations of data off their core DB which is why it's taking so long.

hinkley · 5 months ago
Feature factories falling apart always remind me of the stories doctors tell about patients showing up with symptoms that the doctor could have done something about two years ago. Like diabetes so far progressed that amputation is now on the list of possibilities.

Nobody asks for help when the help can still be productive. It's always the death bed conversion.

esafak · 5 months ago
If so it is probably a good time to apply for an SRE position there unless they really do not get it!
acedTrex · 5 months ago
An outage of this magnitude is almost ALWAYS the direct and immediate fault of senior leaderships priorities and focus. Pushing too hard in some areas, not listening to engineers on needed maintenance tasks etc.
AsmodiusVI · 5 months ago
And engineers never are the cause of mistakes? There can't possibly be any data to back up that major outages are more often caused by leadership. I've been in SIEs simply because someone pushed a network outage to a switch network. Statements like these only go to show how much we have to learn, humble ourselves, and stop blaming others all the time.
AlotOfReading · 5 months ago
Leadership can include engineers responsible for technical priorities. If you're down for that long though, it's usually an organizational fuck-up because the priorities didn't include identifying and mitigating systemic failure modes. The proximate cause isn't all that important and the people who set organizational priorities are by-and-large not engineers.
acedTrex · 5 months ago
PROLONGED outages are a failure point that more often than not, require organizational dysfunction to happen.
bravesoul2 · 5 months ago
Think of airplane safety. I think it is similar. A good culture can make sure $root-cause is more likely detected, tested, isolated, monitored, easy to roll back and so on.
nusl · 5 months ago
My sympathy for those in the mud dealing with this. Never a fun place to be. Hope y'all figure it out and manage to de-stress :)
willejs · 5 months ago
Hugops to the people working on this for the last 31+ hours. Running incidents of this significance is hard, draining and requires a lot of effort, this going on for so long must be very difficult for all involved.
bravesoul2 · 5 months ago
Hopefully they are rotating teams not people staying awake for a dangerous amount of time.
mattbillenstein · 5 months ago
We're sorry https://www.youtube.com/watch?v=9u0EL_u4nvw

Edit, an outage of this length smells of bad systems architecture...

hinkley · 5 months ago
Prediction: Someone confidently broke something, then confidently 'fixed' it, with the consequence of breaking more things instead. And now they have either been pulled off of the cleanup work or they wish they had been.
bravesoul2 · 5 months ago
Wow >31h I am surprised they couldnt rebuild their entire systems in parallel on new infra in that time. Can be hard if data loss is invokved tho (a guess). Would love to see the post mortem so we all can learn.
stackskipton · 5 months ago
I doubt it’s infra failure but software failure. Their bad design has caught up and they can’t toss more hardware for some reason. Most companies have this https://xkcd.com/2347/ in their stack and it’s fallen over.
dangoodmanUT · 5 months ago
Hugs for their SREs sweating bullets rn
wavemode · 5 months ago
progbits · 5 months ago
> 99.99%+ uptime is the standard we need to meet, and lately, we haven’t.

Four nines is not what I would be citing at this point. (That's less than an hour per year, so they burned that for next three decades)

Maybe aim for 99% first.

Otherwise a pretty honest and solid response, kudos for that!

Spivak · 5 months ago
I strive for one 9, thank you. No need to overcomplicate. We use Lambda on top of Glacier.
zamadatix · 5 months ago
One could have nearly 3 such incidents per year and still have hit 99%.

I always strive for 7 9s myself, just not necessarily consecutive digits.

theideaofcoffee · 5 months ago
Lots get starry-eyed and aim for five nines right out of the gate where they should have been targeting nine fives and learning from that. Walk before you run.

Deleted Comment

edoceo · 5 months ago
Interesting the phrase "I'm sorry" was in there. Almost feels like someone in the Big Chair taking a bit of responsibility. Cheers to that.
thih9 · 5 months ago
> Change controls are tighter, and we’re investing in long-term performance improvements, especially in the CMS.

This reads as if overall performance was an afterthought and this doesn’t seem practical; it should be a business metric, it is important to the users after all.

Then again, it’s easy to comment like this in hindsight. We’ll see what happens long term.

newZWhoDis · 5 months ago
As a former webflow customer I can assure you performance was always an afterthought.
stackskipton · 5 months ago
I mean, if customers don’t leave them over this, higher ups likely won’t care after dust settles.
bravesoul2 · 5 months ago
Decent update. Guess people are really waiting for a fix tho!