While considering that sounds sensible, it seems the on-call was able to escalate to the team with very little delay.
As far as I can tell from the timeline, it only took 11 minutes from the moment the on-call first attempted the action until the ops team began responding.
Given that this issue was caused by someone unintentionally using a level of access that they had to do something they did not intend, and the minimal impact reduction, deciding not to grant higher levels of access to the on-call seems to me to be the right decision.
Cloudflare has nice services that they make available to a lot of people for free.
At the same time, this reminds me the cloud is someone else's computer, and seeking input and ideas of how to have a failover with other services or something.
Does anyone know of a setup or design that can shim in a bit of redundancy with something like this? Using one cloud does kind of tie you to them a bit more.
Not really -- the circular dependency issue happened 28 minutes after the customer impact started. That issue was a cause of the outage being double the length it could have been, but not an original cause of the outage.
It's comforting to see this happen to a big tech co!
https://hn.algolia.com/?q=cloudflare+incident
As far as I can tell from the timeline, it only took 11 minutes from the moment the on-call first attempted the action until the ops team began responding.
Given that this issue was caused by someone unintentionally using a level of access that they had to do something they did not intend, and the minimal impact reduction, deciding not to grant higher levels of access to the on-call seems to me to be the right decision.
At the same time, this reminds me the cloud is someone else's computer, and seeking input and ideas of how to have a failover with other services or something.
Does anyone know of a setup or design that can shim in a bit of redundancy with something like this? Using one cloud does kind of tie you to them a bit more.