Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.
If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...
So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so
I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too. I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
Engineers are working the problem. They have a pretty good understanding of the impact of the outage. Then an external comms person asks for an engineer to proof read the external outage comms. Which triggers rounds of "no, this part is not technically correct" and "I know the internal system scope impact, but not how that maps to external product names you want to communicate".
Sure, it'd be nice if the message "we are investigating an issue with… uh… some products" would come up faster.
I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.