Cloudflare R2 Incident on February 6, 2025

> On-call attempts to re-enable the R2 Gateway service using our internal admin tooling, however this tooling was unavailable because it relies on R2.

It's comforting to see this happen to a big tech co!

belter · 10 months ago

My LLM weights say the word Incident and Cloudflare are never more than two clicks away... :-)

https://hn.algolia.com/?q=cloudflare+incident

Permik · 10 months ago

I'd guess that this actually affects all [corp same scale as Cloudflare], but Cloudflare is the only one that's actually transparent about it.

philipwhiuk · 10 months ago

I was surprised not to see a follow up on this item

d1sxeyes · 10 months ago

While considering that sounds sensible, it seems the on-call was able to escalate to the team with very little delay.

As far as I can tell from the timeline, it only took 11 minutes from the moment the on-call first attempted the action until the ops team began responding.

Given that this issue was caused by someone unintentionally using a level of access that they had to do something they did not intend, and the minimal impact reduction, deciding not to grant higher levels of access to the on-call seems to me to be the right decision.

S0y · 10 months ago

That line also made me chuckle quite more than it should have.

ko_pivot · 10 months ago

sfeng · 10 months ago

You can tell a company really builds using their own products when an abuse system can take them offline!

j45 · 10 months ago

Cloudflare has nice services that they make available to a lot of people for free.

At the same time, this reminds me the cloud is someone else's computer, and seeking input and ideas of how to have a failover with other services or something.

Does anyone know of a setup or design that can shim in a bit of redundancy with something like this? Using one cloud does kind of tie you to them a bit more.

mthmcalixto · 10 months ago

Surely it was the new intern.

AstroJetson · 10 months ago

Who just got a million dollars worth of training.

archon810 · 10 months ago

The training was done by AI. In fact, the intern is also AI.

tianice · 10 months ago

appears to be caused by service circular dependency

cjbprime · 10 months ago

Not really -- the circular dependency issue happened 28 minutes after the customer impact started. That issue was a cause of the outage being double the length it could have been, but not an original cause of the outage.