These guys are messing with planes and don’t test enough? Is there an explanation these people aren’t just incompetent?
All materials ultimately succomb when exposed long enough at some high enough temperature.
What is the temperature range to match here?
There is no way btrfs can be slower than this in any shape or form.
If we are comparing something simpler. Like making a copy of the SQLite database periodically. It makes sense for a COW snapshot to be faster than copying the whole database. After reading the btrfs documentation, it seems reasonable to assume that the snapshot latency will stay constant, while a full copy would slow down as the single file database grows bigger.
And so it stands of reason that freezing the database during a full copy is worse than freezing it during a btrfs snapshot. And a full copy of the snapshot can then be performed, optionally with a lower IO priority for good measure.
It should be obvious that the less data is physically read or written on the hot path, the less impact there is on latency.
For what is worth, here is a benchmark comparing IO performance on a few linux filesystem. Including some sqlite tests. https://www.phoronix.com/review/linux-615-filesystems
Meanwhile pirated movies are in Blu-ray quality, with all audio and language options you can dream of.
Also, the last time I checked the Linux scheduling quanta was about 10ms, so it’s not clear backups are going to even be the maximum duration downtime while the system is healthy.
Why would the scheduler tick frequency even matters for this discussion. Even on a single cpu/core/thread system. For what is worth, the default scheduler tick rate has been 2.5ms since 2005. Earlier this year somebody proposed switching back to 1ms.
https://btrfs.readthedocs.io/en/latest/dev/dev-btrfs-design....https://docs.kernel.org/admin-guide/pm/cpuidle.htmlhttps://docs.redhat.com/en/documentation/red_hat_enterprise_...https://sqlite.org/wal.html#ckpthttps://www.phoronix.com/news/Linux-2025-Proposal-1000Hz
(code at https://github.com/accretional/collector - forgive the documentation. I'm working on a container-based agent project and also trialling using agents heavily to write the individual features. It's working pretty well but the agents have been very zealous at documenting things lol).
This is my first real project using sqlite and we've hit some similarly cool benchmarks:
* 5-15ms downtime to backup a live sqlite db with a realistic amount of data for a crud db
* Capable of properly queueing hundreds of read/write operations when temporarily unavailable due to a backup
* e2e latency of basically 1ms for CRUD operations, including proto SerDe
* WAL lets us do continuous, streaming, chunked backups!
Previously I'd only worked with Postgres and Spanner. I absolutely love sqlite so far - would still use Spanner for some tasks with an infinite budget but once we get Collector to implement partitions I don't think I would ever use Postgres again.
Did you consider using a filesystem with atomic snapshots? For example sqlite with WAL on BTRFS. As far as I can tell, this should have a decent mechanical sympathy.
edit: I didn't really explain myself. This is for zero downtime backups. Snapshot, backup at your own pace, delete the snapshot.
I used to be an SRE at Google. Back then we also had big outages caused by bad data files pushed to prod. It's a common enough issue so I really sympathize with Cloudflare, it's not nice to be on call for issues like that. But Google's prod environments always generated stack traces for every kind of failure, including CHECK failures (panics) in C++. You could also reflect the stack traces of every thread via HTTP. I used to diagnose bugs in production under time pressure quite regularly using just these tools. You always need detailed diagnostics.
Languages shouldn't have panics, tbh, it's a primitive concept. It so rarely makes sense to handle errors that way. I know there's a whole body of Rust/Go lore claiming panics are fine, but it's not a good move and is one of the reasons I've stayed away from Go over the years and wouldn't use Rust for anything higher than low level embedded components or operating system code that has to export a C ABI. You always want diagnostics and recoverable errors; this kind of micro-optimization doesn't make sense outside of extremely constrained embedded environments that very few of us work in.
https://doc.rust-lang.org/std/panic/index.html
An uncaught exception in C++ or an uncaught panic in Rust terminates the program. The unwinding is the same mechanism. I think the implementation is what comes with LLVM, but I haven't checked.
I was also a Google SRE, and I liked the stacktrace facilities so much that I got permission to open source a library inspired from it: https://github.com/bombela/backward-cpp (I know I am not doing a great job maintaining it)
At Uber I implemented a similar stackrace introspection for RPC tasks via HTTP for Go services.
You can also catch a Go panic. Which we did in our RPC library at Uber.
It would be great for all of that to somehow come ready made though. A sort of flag "this program is a service, turn on all the good diagnostics, here is my main loop".
Similarly, capturing a stack trace in a error type (within a Result for example) is perfectly possible. But this is a choice left to the programmer, because capturing a trace is not cheap.
I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.
To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.
Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.
To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.
I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
But yes, I wish I had learned more, and somehow stumbled upon all the good stuff, or be taught at university about at least what Rust achieves today.
I think it has to be noted Rust still allows performance with the safety it provides. So that's something maybe.
The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.
Look at the error message Cloudflare's engineers were faced with:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.A proxy stack written in a managed language with exceptions would have given an error message like this:
com.cloudflare.proxy.botfeatures.TooManyFeaturesException: 200 > 60
at com.cloudflare.proxy.botfeatures.FeatureLoader(FeatureLoader.java:123)
at ...
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):
https://blog.plan99.net/what-s-wrong-with-exceptions-nothing...
tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.
By default it is disabled in release mode.