bombela (u/bombela) - Readit News

bombela commented on Writing our own cheat engine (2021) lonami.dev/blog/woce-1/... · Posted by u/hu3

I didn't know you could read random process memory in Linux. Where can I get documentation for things like this? I was learning about cgroups some time ago and got frustrated about the lack of documentation. I had to go read containerd code, which isn't ideal for wanting to just learn.

bombela · 7 days ago

The entry point of interest is probably ptrace: https://man7.org/linux/man-pages/man2/ptrace.2.html

bombela commented on A Cozy Mk IV light aircraft crashed after 3D-printed part was weakened by heat bbc.com/news/articles/c1w... · Posted by u/toss1

2muchcoffeeman · 12 days ago

I’m not necessarily saying it can’t be done. But these are plastics that fail under heat. I’d test part for non critical applications and I’m just a nobody amateur.

These guys are messing with planes and don’t test enough? Is there an explanation these people aren’t just incompetent?

bombela · 11 days ago

> But these are plastics that fail under heat.

All materials ultimately succomb when exposed long enough at some high enough temperature.

What is the temperature range to match here?

bombela commented on 100k TPS over a billion rows: the unreasonable effectiveness of SQLite andersmurphy.com/2025/12/... · Posted by u/speckx

hedora · 13 days ago

Well, I haven’t checked it for a while. Still, try measuring FS latencies during checkpoints, or just write a tight loop program that reads cached data and prints max latencies once an hour. Use the box for other stuff while it runs.

bombela · 11 days ago

https://github.com/accretional/collector/blob/main/pkg/colle...

There is no way btrfs can be slower than this in any shape or form.

If we are comparing something simpler. Like making a copy of the SQLite database periodically. It makes sense for a COW snapshot to be faster than copying the whole database. After reading the btrfs documentation, it seems reasonable to assume that the snapshot latency will stay constant, while a full copy would slow down as the single file database grows bigger.

And so it stands of reason that freezing the database during a full copy is worse than freezing it during a btrfs snapshot. And a full copy of the snapshot can then be performed, optionally with a lower IO priority for good measure.

It should be obvious that the less data is physically read or written on the hot path, the less impact there is on latency.

For what is worth, here is a benchmark comparing IO performance on a few linux filesystem. Including some sqlite tests. https://www.phoronix.com/review/linux-615-filesystems

bombela commented on Netflix’s AV1 Journey: From Android to TVs and Beyond netflixtechblog.com/av1-n... · Posted by u/CharlesW

bofaGuy · 12 days ago

Netflix has been the worst performing and lowest quality video stream of any of the streaming services. Fuzzy video, lots of visual noise and artifacts. Just plan bad and this is on the 4k plan on 1GB fiber on a 4k Apple TV. I can literally tell when someone is watching Netflix without knowing because it looks like shit.

bombela · 12 days ago

Yep, and they also silently downgrade resolution and audio channels on an ever changing and hidden list of browsers/OS/device overtime.

Meanwhile pirated movies are in Blu-ray quality, with all audio and language options you can dream of.

bombela commented on 100k TPS over a billion rows: the unreasonable effectiveness of SQLite andersmurphy.com/2025/12/... · Posted by u/speckx

hedora · 15 days ago

If it’s at 5-15ms of downtime already, you’re in the space where the “zero” downtime FS might actually cause more downtime. In addition to pauses while the snapshot is taken, you’d need to carefully measure things like performance degradation while the snapshot exists (incurring COW costs) and while it’s being GCed in the background.

Also, the last time I checked the Linux scheduling quanta was about 10ms, so it’s not clear backups are going to even be the maximum duration downtime while the system is healthy.

bombela · 14 days ago

I am not so sure you know what you are talking about. Feel free to provide some reading material for my education.

Why would the scheduler tick frequency even matters for this discussion. Even on a single cpu/core/thread system. For what is worth, the default scheduler tick rate has been 2.5ms since 2005. Earlier this year somebody proposed switching back to 1ms.

https://btrfs.readthedocs.io/en/latest/dev/dev-btrfs-design....https://docs.kernel.org/admin-guide/pm/cpuidle.html https://docs.redhat.com/en/documentation/red_hat_enterprise_...https://sqlite.org/wal.html#ckpt https://www.phoronix.com/news/Linux-2025-Proposal-1000Hz

bombela commented on 100k TPS over a billion rows: the unreasonable effectiveness of SQLite andersmurphy.com/2025/12/... · Posted by u/speckx

weitendorf · 15 days ago

I've been working on a hybrid protobuf ORM/generic CRUD server based on sqlite

(code at https://github.com/accretional/collector - forgive the documentation. I'm working on a container-based agent project and also trialling using agents heavily to write the individual features. It's working pretty well but the agents have been very zealous at documenting things lol).

This is my first real project using sqlite and we've hit some similarly cool benchmarks:

* 5-15ms downtime to backup a live sqlite db with a realistic amount of data for a crud db

* Capable of properly queueing hundreds of read/write operations when temporarily unavailable due to a backup

* e2e latency of basically 1ms for CRUD operations, including proto SerDe

* WAL lets us do continuous, streaming, chunked backups!

Previously I'd only worked with Postgres and Spanner. I absolutely love sqlite so far - would still use Spanner for some tasks with an infinite budget but once we get Collector to implement partitions I don't think I would ever use Postgres again.

bombela · 15 days ago

> * 5-15ms downtime to backup a live sqlite db with a realistic amount of data for a crud db

Did you consider using a filesystem with atomic snapshots? For example sqlite with WAL on BTRFS. As far as I can tell, this should have a decent mechanical sympathy.

edit: I didn't really explain myself. This is for zero downtime backups. Snapshot, backup at your own pace, delete the snapshot.

bombela commented on Cloudflare outage on November 18, 2025 post mortem blog.cloudflare.com/18-no... · Posted by u/eastdakota

mike_hearn · a month ago

There's clearly a big gap in how things are done in practice. You wouldn't see anyone call System.exit in a managed language if a data file was bigger than expected. You'd always get an exception.

I used to be an SRE at Google. Back then we also had big outages caused by bad data files pushed to prod. It's a common enough issue so I really sympathize with Cloudflare, it's not nice to be on call for issues like that. But Google's prod environments always generated stack traces for every kind of failure, including CHECK failures (panics) in C++. You could also reflect the stack traces of every thread via HTTP. I used to diagnose bugs in production under time pressure quite regularly using just these tools. You always need detailed diagnostics.

Languages shouldn't have panics, tbh, it's a primitive concept. It so rarely makes sense to handle errors that way. I know there's a whole body of Rust/Go lore claiming panics are fine, but it's not a good move and is one of the reasons I've stayed away from Go over the years and wouldn't use Rust for anything higher than low level embedded components or operating system code that has to export a C ABI. You always want diagnostics and recoverable errors; this kind of micro-optimization doesn't make sense outside of extremely constrained embedded environments that very few of us work in.

bombela · a month ago

A panic in Rust is the same as an exception in C++. You can catch it all the same.

https://doc.rust-lang.org/std/panic/index.html

An uncaught exception in C++ or an uncaught panic in Rust terminates the program. The unwinding is the same mechanism. I think the implementation is what comes with LLVM, but I haven't checked.

I was also a Google SRE, and I liked the stacktrace facilities so much that I got permission to open source a library inspired from it: https://github.com/bombela/backward-cpp (I know I am not doing a great job maintaining it)

At Uber I implemented a similar stackrace introspection for RPC tasks via HTTP for Go services.

You can also catch a Go panic. Which we did in our RPC library at Uber.

It would be great for all of that to somehow come ready made though. A sort of flag "this program is a service, turn on all the good diagnostics, here is my main loop".

bombela commented on Cloudflare outage on November 18, 2025 post mortem blog.cloudflare.com/18-no... · Posted by u/eastdakota

mike_hearn · a month ago

It's one of the problems with using result types. You don't distinguish between genuinely exceptional events and things that are expected to happen often on hot paths, so the runtime doesn't know how much data to collect.

bombela · a month ago

panic is the exceptional event. It so happens that rust doesn't print a stacktrace in release unless configured to do so.

Similarly, capturing a stack trace in a error type (within a Result for example) is perfectly possible. But this is a choice left to the programmer, because capturing a trace is not cheap.

bombela commented on Cloudflare outage on November 18, 2025 post mortem blog.cloudflare.com/18-no... · Posted by u/eastdakota

StopDisinfo910 · a month ago

Alternatively you can look at actually innovative programming languages to peek at the next 20 years of innovation.

I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.

To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.

Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.

To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.

I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?

bombela · a month ago

I agree with you. But onsidering nobody learns any real engineering in software; myself solidly included; this is still an improvement.

But yes, I wish I had learned more, and somehow stumbled upon all the good stuff, or be taught at university about at least what Rust achieves today.

I think it has to be noted Rust still allows performance with the safety it provides. So that's something maybe.

bombela commented on Cloudflare outage on November 18, 2025 post mortem blog.cloudflare.com/18-no... · Posted by u/eastdakota

mike_hearn · a month ago

Really not! This is a huge faceplant for writing things in Rust. If they had been writing their code in Java/Kotlin instead of Rust, this outage either wouldn't have happened at all (a failure to load a new config would have been caught by a defensive exception handler), or would have been resolved in minutes instead of hours.

The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.

Look at the error message Cloudflare's engineers were faced with:

     thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.

A proxy stack written in a managed language with exceptions would have given an error message like this:

    com.cloudflare.proxy.botfeatures.TooManyFeaturesException: 200 > 60
        at com.cloudflare.proxy.botfeatures.FeatureLoader(FeatureLoader.java:123)
        at ...

and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.

In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):

https://blog.plan99.net/what-s-wrong-with-exceptions-nothing...

bombela · a month ago

https://doc.rust-lang.org/std/backtrace/index.html#environme...

tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.

By default it is disabled in release mode.