Readit News logoReadit News
hansvm commented on Branch prediction: Why CPUs can't wait?   namvdo.ai/cpu-branch-pred... · Posted by u/signa11
zenolijo · 5 days ago
I do wonder how branch prediction actually works in the CPU, predicting which branch to take also seems like it should be expensive, but I guess something clever is going on.

I've also found G_LIKELY and G_UNLIKELY in glib to be useful when writing some types of performance-critical code. Would be a fun experiment to compare the assembly when using it and not using it.

hansvm · 5 days ago
Semantically it's just a table from instruction location to branch probability. Some nuances exist in:

- Table overflow mitigation: multi-leveled tables, not wasting space on 100% predicted branches, etc

- Table eviction: Rolling counts are actually impossible without space consumption; do you have space wasted, periodic flushing, exponential moving averages, etc

- Table initialization: When do you start caring about a branch (and wasting table space), how conservative are the initial parameters, etc

- Table overflow: What do you do when a branch doesn't fit in the table but should

As a rule of thumb, no extra information/context is used for branch prediction. If a program over the course of a few thousand instructions hits a branch X% of the time, then X will be the branch prediction. If you have context you want to use to influence the prediction, you need to manifest that context as additional lines of assembly the predictor can use in its lookup table.

As another rule of thumb, if the hot path has more than a few thousand branches (on modern architectures, often just a few thousand <100% branches (you want the assembly to generate the jump-if-not-equal in the right direction for that architecture though, else you'll get a 100% misprediction rate instead)) then you'll hit slow paths -- multi-leveled search, mispredicted branches, etc.

It's reasonably interesting, and given that it's hardware it's definitely clever, but it's not _that_ clever from a software perspective. Is there anything in particular you're curious about?

hansvm commented on Left to Right Programming   graic.net/p/left-to-right... · Posted by u/graic
baalimago · 6 days ago
You don't need to be bothered with minute details such as syntax for counting string length anymore, you just need to know what you want to do. I mention this since OP is bringing up LSP's as an argument for why certain language's design is suboptimal.

"Count length of string s" -> LLM -> correct syntax for string-count for any programming language. This is the perfect context-length for an LLM. But note that you don't "complete the line", you tell the LLM what you want to have done in full (very isolated) context, instead of having it guessing.

hansvm · 5 days ago
I don't think that meaningfully engages with the author's point.

If you already know how to compute `len` on some arbitrary syntax soup then the difference is just a minor annoyance where you have to jump back in your editor, add a function call and some punctuation, and jump back to where you were to add some closing punctuation. It's so fast you'd never bother with an LLM, so despite real and meaningful differences existing the LLM discussion point isn't relevant.

If you don't know how to compute `len` on some arbitrary syntax soup, I don't see how crafting an ideal prompt in a "full (very isolated) context" is ever faster than tab-completing things which look like "count" or "len."

hansvm commented on OpenBSD is so fast, I had to modify the program slightly to measure itself   flak.tedunangst.com/post/... · Posted by u/Bogdanp
mananaysiempre · 9 days ago
>> does not account for frequency scaling on laptops

> Are you sure about that?

> [...] RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer [...]

So we are in agreement here: with RDTSC you’re not counting cycles, you’re counting seconds. (That’s what I meant by “does not account for frequency scaling”.) I guess there are legitimate reasons to do that, but I’ve found organizing an experimental setup for wall-clock measurements to be excruciatingly difficult: getting 10–20% differences depending on whether your window is open or AC is on, or on how long the rebuild of the benchmark executable took, is not a good time. In a microbenchmark, I’d argue that makes RDTSC the wrong tool even if it’s technically usable with enough work. In other situations, it might be the only tool you have, and then sure, go ahead and use it.

> The time spent in syscalls was the main objective the OP was measuring.

I mean, of course I’m not covering TFA’s use case when I’m only speaking about Linux and Windows, but if you do want to include time in syscalls on Linux that’s also only a flag away. (With a caveat for shared resources—you’re still not counting time in kswapd or interrupt handlers, of course.)

hansvm · 9 days ago
Cycles are often not what you're trying to measure with something like this. You care about whether the program has higher latency, higher inverse throughput, and other metrics denominated in wall-clock time.

Cycles are a fine thing to measure when trying to reason about pieces of an algorithm and estimate its cost (e.g., latency and throughput tables for assembly instructions are invaluable). They're also a fine thing to measure when frequency scaling is independent of the instructions being executed (since then you can perfectly predict which algorithm will be faster independent of the measurement noise).

That's not the world we live in though. Instructions cause frequency scaling -- some relatively directly (like a cost for switching into heavy avx512 paths on some architectures), some indirectly but predictably (physical limits on moving heat off the chip without cryo units), some indirectly but unpredictably (moving heat out of a laptop casing as you move between having it on your lap and somewhere else). If you just measure instruction counts, you ignore effects like the "faster" algorithm always throttling your CPU 2x because it's too hot.

One of the better use cases for something like RDTSC is when microbenchmarking a subcomponent of a larger algorithm. You take as your prior that no global state is going to affect performance (e.g., not overflowing the branch prediction cache), and then the purpose of the measurement is to compute the delta of your change in situ, measuring _only_ the bits that matter to increase the signal to noise.

In that world, I've never had the variance you describe be a problem. Computers are fast. Just bang a few billion things through your algorithm and compare the distributions. One might be faster on average. One might have better tail latency. Who knows which you'll prefer, but at least you know you actually measured the right thing.

For that matter, even a stddev of 80% isn't that bad. At $WORK we frequently benchmark the whole application even for changes which could be microbencmarked. Why? It's easier. Variance doesn't matter if you just run the test longer.

You have a legitimate point in some cases. E.g., maybe a CLI tool does a heavy amount of work for O(1 second). Thermal throttling will never happen in the real world, but a sustained test would have throttling (and also different branch predictions and whatnot), so counting cycles is a reasonable proxy for the thing you actually care about.

I dunno; it's complicated.

hansvm commented on Weathering Software Winter (2022)   100r.co/site/weathering_s... · Posted by u/todsacerdoti
strken · 13 days ago
I have only very brief experience from once owning half of a 15 foot fibreglass runabout fixer-upper, but if you've got a 30 foot yacht then can't you just stick it on a trailer yourself? I feel like you're imagining a much bigger craft.
hansvm · 12 days ago
On top of that, dry docks are a common free amenity in boating/fishing towns, just using the tide to do your dirty work.
hansvm commented on What does it mean to be thirsty?   quantamagazine.org/what-d... · Posted by u/pseudolus
avalys · 13 days ago
I'm so interested in this topic, for a weird reason.

Since I was a kid, I've thought I was "prone to migraines", and ascribed various triggers to them - sun exposure, heat, physical exertion, mental exertion, etc. I'd get a migraine sometimes after a long hike on a weekend - and also a long business meeting entirely indoors in an air-conditioned space.

Only when I was around 35, did I figure something out. All these situations lead to me getting dehydrated without any obvious accompanying feeling of thirst. Hiking all day will do it - walking around an outdoor shopping mall on a hot afternoon - or sitting in an all-day business meeting focused on the work at hand and forgetting to drink. And all these situations lead to a migraine - my only "migraine" trigger is simple dehydration, nothing more complicated.

The weird thing is, it took me a long time (decades) to put this together, because I just figured that I couldn't be dehydrated if I wasn't thirsty, and I had no association between "feeling thirsty" and getting a migraine.

I get what I consider normally thirsty in other circumstances, but somehow there's a failure mode where my body doesn't warn me. So now I just remember to chug lots of water (and electrolytes) if I'm exerting myself even if I don't really feel thirsty, and I can systematically avoid triggering migraines.

Now that I understand it the association is quite clear and obvious in retrospect.

hansvm · 13 days ago
In activities prone to heatstroke, the advice is similar. Even for people with normal thirst detection capabilities, by the time that you realize you're thirsty it's likely too late. You need to be proactive about drinking enough water.
hansvm commented on Byte Buddy is a code generation and manipulation library for Java   bytebuddy.net/... · Posted by u/mooreds
selimco · 13 days ago
It seems like micronaut has been able to avoid runtime bytecode generation by doing everything at compile-time. I wonder if there’s things that you can’t do the micronaut way.
hansvm · 13 days ago
Sure:

- There are how many computer architectures? A compile-once-run-anywhere binary looks closer to shipping a fancy interpreter with your code than shipping a compiled project. Runtime bytecode generation is one technique for making that fast.

- More generally, anything you don't know till runtime generates a huge amount of bloat if you handle it at compile-time. Imagine, e.g., a UI for dragging and dropping ML components to create an architecture. For as much compute as you're about to pour into training, even for very simple problems, it's worth something that looks like a compilation pass to appropriately fuse everything together. You could probably get away with literally shipping a compiler, but bytecode generation is a reasonable solution too.

- Some things are literally impossible at compile-time without boxing and other overhead. E.g., once upon a time I made a zero-cost-abstraction library allowing you to specify an ML computational graph using the type system (most useful for problems where you're not just doing giant matmuls all day). It was in a language where mutually recursive generics are lazily generated, so you're able to express arbitrary nth derivatives still in the type system, still with zero overhead. What you can't do though is create a runtime program capable of creating arbitrary derivatives; there must be an upper bound for any finite-sized binary (for sufficiently complex starting functions) -- you could cap it at 2nd derivatives or 10th or whatever, but there would have to be a cap. If you move that to runtime though then you can have your cake and eat it too, less the cost of compiling (i.e., bytecode generation) at runtime.

Etc. It's a tradeoff between binary size (which might have to be infinite in the compiled case) and runtime overhead (having to "compile" for each new kind of input you find).

hansvm commented on Faster substring search with SIMD in Zig   aarol.dev/posts/zig-simd-... · Posted by u/todsacerdoti
unwind · 14 days ago
Can the compiler detect that and use the proper code so no test is needed at runtime?

This is Zig so I guess the answer is "yeah, duh" but wanted to ask since it sounded like the solution is less, uh, "compiler-friendly" than I would expect.

hansvm · 13 days ago
Yes, and if you're paranoid you can write

  if (comptime T == u8) {
    // code
  }
to guarantee that if you're wrong about how the compiler behaves then you'll get a compiler error.

hansvm commented on Faster substring search with SIMD in Zig   aarol.dev/posts/zig-simd-... · Posted by u/todsacerdoti
aarol · 14 days ago
I'm the author of the post, thanks for your feedback! I was inspired by your comment on HN a while back and started learning about this stuff, reading the source code of `memchr` was especially great.

You're totally right about the first part there was a serious consideration to add this to zig's standard library, there would definitely need to be a fallback to avoid the `O(m*n)` situation.

I'll admit that there are a lot of false assumptions at the end, you could totally specialize it for u8 and also get the block size according to CPU features at compile time with `std.simd.suggestVectorSize()`

hansvm · 13 days ago
Or at runtime, if you'd like. You can create a generic binary that runs faster on supported platforms.
hansvm commented on Food, housing, & health care costs are a source of major stress for many people   apnorc.org/projects/food-... · Posted by u/speckx
kasey_junk · 16 days ago
Sure, but you can run the numbers for basically every grocery in America. The industry considers 3% margins to be outstanding.
hansvm · 14 days ago
That reads more as an indictment against hyperscaled grocers than an argument that the current price gouging is unavoidable.
hansvm commented on Why insurers worry the world could soon become uninsurable   cnbc.com/2025/08/08/clima... · Posted by u/mooreds
kibwen · 14 days ago
> Second, governments and private companies should be looking at (socializing) mitigations that will keep risk within tolerable levels

The moral hazard is killer here. I fear that in practice what this means is that the rest of the US will end up bailing out the gormless Floridians who refuse to stop building McMansions on the coast. Insert the gif of Bugs Bunny cutting off Florida and letting it drift off into the Caribbean.

hansvm · 14 days ago
Easy. Don't apply the policy to new homes.

u/hansvm

KarmaCake day4450December 22, 2019View Original