Some of our metrics came in 5 minutes delayed which wasn't a problem for normal days. These metrics moved slow enough that when you got an alarm, there was still plenty of time to take corrective action.
But for HVEs this was an issue. During black Friday or prime day, sometime some metrics spiked so fast you had no time to respond (usually from people hitting page reload a few minutes before a sale kicked off.)
To get an idea for what was going on, I would go in twitter and search for things like "amazon failure" or "amazon 502."
We often got problem reports via Twitter before they showed up on our dashboards.
SRO is madness. SRO is admitting that you're enough underwater and there's enough unexpected critical events to dedicate a team to burn out on it. There was a lot of spend in overcapacity and failover automation testing that kept it that way, and explicit mandates by a C-Suite and VP to keep keeping it that way. More importantly, though, it was baked into the architecture.
Everyone's talking this week about how Twitter will go up in flames in the next few days; I don't think so. I think they're closer to Google in that respect. There's a lot of automation that will keep things alive and a lot of redundancy that will handle the immediate problems of having a lot of the team leave. I actually think the company can continue to run indefinitely if they just follow the operations manuals and replace failed drives and machines. But I think there will be a major outage when they try to launch a new feature and fail.