> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
I hope the pendulum swings the other way around now in the discussion.
[disclaimer that I worked as a GCP SRE for a long time, but not left recently]
Wow, I didn't know they where basically a MSFT subsidiary. Who are their other customers ?
The more important question is that they've removed the Activity feed, where you could see likes from other people. Which was like a realtime feed to what your friends were doing on the website. The website is way more boring now.
Was there a lot of anxiety? Panic? Or was it just a “woof that sucks. Time to follow a checklist and then do a bunch of paper work” ?
What I’m curious about is what it feels like on a team at a company like Google when there is a major system failure.
Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.
Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.
Hey Dang, thanks for cleaning up the thread. One thing to note is that the title is not correct. The entire region is not currently down, as the regional impact was mitigated as of 06:39 PDT, per the support dashboard (though I think it was earlier). The impact is currently zonal (europe-west9-a), so having zone in the title as opposed to region would reflect reality closer.
Finally, there's lots of good feedback on this thread and on the previous one (https://news.ycombinator.com/item?id=35711349), so we obviously have a lot of lessons to learn.
https://status.cloud.google.com/incidents/8cY8jdUpEGGbsSMSQk...