Readit News logoReadit News
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
keypusher · 3 months ago
The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.
eastdakota · 3 months ago
That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
philipwhiuk · 3 months ago
Except it fails to document anything about the actions they made to Warp in London during the resolution.
eastdakota · 3 months ago
There’s lots of things we did while we were trying to track down and debug the root cause that didn’t make it into the post. Sorry the WARP takedown impacted you. As I said in a comment above, it was the result of us (wrongly) believing that this was an attack targeting WARP endpoints in our UK data centers. That turned out to be wrong but based on where errors initially spiked it was a reasonable hypothesis we wanted to rule out.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
ynx0 · 3 months ago
How do you guys handle redaction? I'm sure even when trusted individuals are in charge of authoring, there's still a potential of accidental leakage which would probably be best mitigated by a team specifically looking for any slip ups.

Thanks for the insight.

eastdakota · 3 months ago
Team has a good sense, typically. In this case, the names of the columns in the Bot Management feature table seemed sensitive. The person who included that in the master document we were working from added a comment: “Should redact column names.” John and I usually catch anything the rest of the team may have missed. For me, pays to have gone to law school, but also pays to have studied Computer Science in college and be technical enough to still understand both the SQL and Rust code here.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
SerCe · 3 months ago
As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.
eastdakota · 3 months ago
* published less than 12 hours from when the incident began. Proud of the team for pulling together everything so quickly and clearly.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
chrismorgan · 3 months ago
> much better than their completely false “checking the security of your connection” message

The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):

> example.com needs to review the security of your connection before proceeding.

It bothers me how this bald-faced lie of a wording has persisted.

(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)

eastdakota · 3 months ago
Next time open your dev console in your window and look at how much is going on in the background.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
philipwhiuk · 3 months ago
Why was Warp in London disabled temporarily. No mention of that change was discussed in the RCA despite it being called out in an update.

For London customers this made the impact more severe temporarily.

eastdakota · 3 months ago
We incorrectly thought at the time it was attack traffic coming in via WARP into LHR. In reality it was just that the failures started showing up there first because of how the bad file propagated and where it was working hours in the world.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
philipgross · 3 months ago
You call this transparency, but fail to answer the most important questions: what was in the burrito? Was it good? Would you recommend?
eastdakota · 3 months ago
Chicken burrito from Coyo Taco in Lisbon. I am not proud of this. It’s worse than ordering from Chipotle. But there are no Chipotle’s in Lisbon… yet.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
yen223 · 3 months ago
I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

eastdakota · 3 months ago
Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
tclancy · 3 months ago
I feel like your username really brings something extra to the party. Now go home.
eastdakota · 3 months ago
Can attest: not a single LLM used. Couldn’t if I tried. Old school. And not entirely proud of that.
eastdakota commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
tptacek · 3 months ago
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).
eastdakota · 3 months ago
That’s correct.

u/eastdakota

KarmaCake day11157December 6, 2010
About
A little bit geek, wonk, and nerd. Repeat entrepreneur, recovering lawyer and former ski instructor. CEO & co-founder of CloudFlare. [ my public key: https://keybase.io/eastdakota; my proof: https://keybase.io/eastdakota/sigs/_uDY0ZsLTEWaNu5daRtuwzZtJDJJrtGS4uXoYxwI634 ]
View Original