`cgroup.memory=nokmem` avoids this.
I believe the kernel's cgroup writeback accounting features are enabled / disabled based on this code: https://github.com/torvalds/linux/blob/c291c9cfd76a8fb92ef3d...
`cgroup.memory=nokmem` avoids this.
I believe the kernel's cgroup writeback accounting features are enabled / disabled based on this code: https://github.com/torvalds/linux/blob/c291c9cfd76a8fb92ef3d...
This is really not great in situations where you're bootstrapping a fresh server. Here's what happens:
- you boot up a server with, say, 1tb RAM
- your default dirty ratio is 10 (best case)
- you quickly write 90gb of files to your server (images, whatever)
- you get mad unblocked throughput as the page cache fills up & the kernel hasn't even tried flushing anything to disk yet
- your application starts, takes up 9gb memory
- starts to serve requests, writes another 1gb of mem mapped cache
- the kernel starts to flush, realises disk is slower than it thought, starts to trip over itself and aggressively throttle IO until it can catch up
- your app is now IO bound while the kernel thrashes around for a bit
This can be tuned by adjusting the vm.dirty_* defaults, and is well worth doing IMO. The defaults that kernels still ship with are from a long time ago when we didn't have this much memory available.
My memory of this next bit is flaky at best, so happy to be corrected here, but I remember this also being a big problem with k8s. With cgroups v1, a node would get added to your cluster and a pod would get scheduled there. The pod would be limited to, say, 4gb memory - way more memory than it actually uses - but it would have a lot of IO operations. Because the node still had a ton of free memory, way below its default dirty writeback ratio/bytes, none of the IO operations would get flushed to disk for ages, but the dirty pages in the page cache would still be counted towards that pod's memory usage even though they weren't 'real' memory, but something completely out of the control of the pod (or kubernetes, really). Before you knew it, bOOM. Pod oomkilled for seemingly no reason, and no way to do anything about it. I remember some issues where people skirted around it by looking off into middle distance and saying the usual things about k8s not being for stateful workloads, but it was really lame and really not talked about enough.
This might seem unrelated, but you guessed it, it was fixed in cgroups v2, and I imagine that the fix for that problem either directly or indirectly explains why OP saw a difference in behaviour between cgroups v1 and v2.
Also, slightly related, I remember discovering a while back that for workloads like this where you've got a high turnover of files & processes, having the `discard` (trim) flag set on your ssd mount could really mess you up (definitely in ext4, not sure about xfs). It would prevent the page cache from evicting pages of deleted files without forcing writeback first, which is obviously the opposite of what it was designed to do (protect/trim the ssds). Not to mention cause all sorts of zombifications when terminated processes still had memmapped files that hadn't been flushed to disk, etc.
AFAIK it's still a problem, though it's been years since I profiled this stuff. At peak load with io-intensive workloads, you could end up with SSDs making your app run slower. Try remounting without the `discard` flag (and periodically fstrim manually), or use `discard=async`, and see what difference it makes.
The problem described in my post was not _directly_ related to the kernel flushing dirty pages to disk. As such, I'm not sure that tweaking these sysctls would have made any difference.
Instead, we were seeing the kernel using too much CPU when it moved inodes from one cgroup to another. This is part of the kernel's writeback cgroup accounting logic. I believe this is a related but slightly different form of writeback problems :)
Although implementing the realtime API in the Opower integration has not yet been completed. That said, I don't think it would be too hard to implement. See: https://github.com/tronikos/opower/issues/24
This realtime data is also available and graphed on your account page on the Con Ed website and mobile app.
I wrote my own code that uses Con Ed's realtime API and writes the data to Prometheus so that I can view it in Grafana. My code was heavily influenced by Home Assistant's Opower integration code. Here's my code: https://github.com/dasl-/pitools/blob/main/sensors/measure_e...
Email and ask
I submitted this message, feel free to copy the same text and submit yourself also:
-----------------------------
I recently became aware that the Living Computers Museum, which was created by Paul Allen (Microsoft co-founder), is shutting down. As someone in the technology industry, I find that very sad! The museum was really magical. I'm wondering if the Gates Foundation can step up and save the museum from closing?
https://www.geekwire.com/2024/seattles-living-computers-muse...
Thank you for your consideration
it works for me on linux, not sure about other OS's. Although I'm now noticing that the article linked in the original post says that Ruby has a pure Ruby replacement for readline: Reline. So I wonder if it will not work with more recent versions of Ruby that use Reline?
I remember watching Farbrausch's "fr-08 .the .produkt" [0] when it came out and telling myself "If a computer can do this with 64KB of data, at this speed, my programs should be able to do the same, or at least shall be close". I was forever poisoned at this point, and this simple sentence shaped my whole academic life and career.
[0]: https://www.pouet.net/prod.php?which=1221
P.S.: Rewatching it, again, for the nth time. Hats off to chaos, fiver2, kb, doj, ryg & yoda.
P.P.S: I show people YouTube version of Elevated (https://www.pouet.net/prod.php?which=52938), and ask them to guess the binary size rendering this thing in real time. The answer blows everyone's mind, every time.
If you're ever in the mood to revisit that problem you should try disabling that discard flag and see if it makes a difference. Also, if it was me, I'd have tried setting LimitNOFILE to whatever it is in my shell and seeing if the rsync still behaved differently.
Anyway - thoroughly enjoyed your article. You should write more :)