Readit News logoReadit News
JB_Dev commented on Cloudflare was down   cloudflare.com/... · Posted by u/mektrik
lima · 24 days ago
They don't appear to have a rollout procedure for some of their globally replicated application state. They had a number of major outages over the past years which all had the same root cause of "a global config change exposed a bug in our code and everything blew up".

I guess it's an organizational consequence of mitigating attacks in real time, where rollout delays can be risky as well. But if you're going to do that, it would appear that the code has to be written much more defensively than what they're doing it right now.

JB_Dev · 24 days ago
Yea agree.. This is the same discussion point that came up last time they had an incident.

I really don’t buy this requirement to always deploy state changes 100% globally immediately. Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.

Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.

Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.

JB_Dev commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
tptacek · a month ago
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).
JB_Dev · a month ago
Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.
JB_Dev commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
abalone · a month ago
I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:

1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.

2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.

SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.

While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.

Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.

JB_Dev · a month ago
Does their ring based rollout really truly have to be 0->100% in a few seconds?

I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.

As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.

JB_Dev commented on Abusing Entra OAuth for fun and access to internal Microsoft applications   research.eye.security/con... · Posted by u/the1bernard
medhir · 5 months ago
ohhhh the gifts multi-tenant app authorization keeps giving!

(laid off) Microsoft PM here that worked on the patch described as a result of the research from Wiz.

One correction I’d like to suggest to the article: the guidance given is to check either the “iss” or “tid” claim when authorizing multi-tenant apps.

The actual recommended guidance we provided is slightly more involved. There is a chance that when only validating the tenant, any service principal could be granted authorized access.

You should always validate the subject in addition to validating the tenant for the token being authorized. One method for this would be to validate the token using a combined key (for example, tid+oid) or perform checks on both the tenant and subject before authorizing access. More info can be found here:

https://learn.microsoft.com/en-us/entra/identity-platform/cl...

JB_Dev · 5 months ago
You are 100% correct but really these engineers should go read the guidance - it’s pretty clear what is required: https://learn.microsoft.com/en-us/entra/identity-platform/cl...
JB_Dev commented on The Myth of Developer Obsolescence   alonso.network/the-recurr... · Posted by u/cat-whisperer
yoyohello13 · 7 months ago
I'm convinced the Microsoft Teams team has gone all in on vibe coding. I have never seen so many broken features released in such a short time frame as the last couple months. This is the future as more companies go all in on AI coding.
JB_Dev · 7 months ago
Nah this is just microsofts quality bar in general. AI will only accelerate the decline.
JB_Dev commented on Watching AI drive Microsoft employees insane   old.reddit.com/r/Experien... · Posted by u/laiysb
robotcapital · 7 months ago
Replace the AI agent with any other new technology and this is an example of a company:

1. Working out in the open

2. Dogfooding their own product

3. Pushing the state of the art

Given that the negative impact here falls mostly (completely?) on the Microsoft team which opted into this, is there any reason why we shouldn't be supporting progress here?

JB_Dev · 7 months ago
100% agree. i’m not sure why everyone is clowning on them here. This process is a win. Do people want this all being hidden instead in a forked private repo?

It’s showing the actual capabilities in practice. That’s much better and way more illuminating than what normally happens with sales and marketing hype.

JB_Dev commented on My new deadline: 20 years to give away virtually all my wealth   gatesnotes.com/home/home-... · Posted by u/nrvn
JB_Dev · 8 months ago
I actually have the opposite position on this. 1st world countries already have the funds and economy to pursue exactly what you describe. Just they lack the political will. I don’t care to subsidise that intentional lack of investment.

I would much rather give to charities focusing on countries that don’t have the economy/ability to fix their basic issues.

JB_Dev commented on Pi-hole v6   pi-hole.net/blog/2025/02/... · Posted by u/tkuraku
progbits · 10 months ago
I recommend putting all these things on their own VLANs with strict routing rules.

For example my STB is on a VLAN that has WAN access (otherwise it won't do anything), but that makes it untrustworthy so it is completely isolated from rest of LAN.

On the other hand some "smart"/IoT devices are on a VLAN that has no WAN access so that they can't phone home, become a botnet, or download firmware updates that remove functionality in favor of subscription services. Only a VM running homeassistant can talk to them.

This will work until amazon sidewalk / built-in LTE modems become too frequent, at that point I'll have to start ripping out the radio modules from things I buy.

JB_Dev · 10 months ago
Call me pessimistic, but as the sidewalk pattern becomes more common for IoT, I wouldn’t be surprised if a “malfunctioning radio” just results in the device not working properly.
JB_Dev commented on Making an intersection unsafe for pedestrians to save seconds for drivers   collegetowns.substack.com... · Posted by u/raybb
potato3732842 · a year ago
As a pedestrian I will take a busy light controlled intersection with a pedestrian scramble type walk signal over a busy 4-way stop where every single time.

With the 4-way stop there is never a time in the cycle when all traffic is stopped. The drivers who are present are continuously paying attention to what other drivers are doing which robs them of situational awareness to note pedestrians. You can try and time it but that's risky. With the walk signal there is a brief moment in time when the drivers are doing nothing but waiting for you and are all stopped so you as a pedestrian can account for them in preparation just before you get your signal and make your move.

The author can get lost with this sort of textbook correct but questionable in reality take. Legally having the right of way doesn't make you any less dead when the driver who's got three other drivers to pay attention to doesn't see you.

JB_Dev · a year ago
Make it a roundabout with protected pedestrian crossings. That forces drivers to be looking at the conflict point with pedestrians as they manoeuvre the roundabout.
JB_Dev commented on Making an intersection unsafe for pedestrians to save seconds for drivers   collegetowns.substack.com... · Posted by u/raybb
DHPersonal · a year ago
I remember my father telling me that was how it was supposed to be done, as the yellow light for oncoming traffic would convince them to stop and give you the time to complete the left turn. It only worked when they weren't also running the yellow light! These days I prefer waiting to turn so that I'm not stuck out in the middle of the intersection when the traffic light changes.
JB_Dev · a year ago
If you do not wait in the intersection itself then you would never get a chance to turn in many intersections. The only solution is to always wait in the intersection itself.

u/JB_Dev

KarmaCake day119July 17, 2022View Original