Readit News logoReadit News
ack_complete commented on Closures as Win32 Window Procedures   nullprogram.com/blog/2025... · Posted by u/ibobev
pjc50 · 21 hours ago
Doesn't x32 only have four registers available in the calling convention, AX-DX?
ack_complete · 20 hours ago
The stdcall calling convention used APIs and API callbacks on Windows x86 doesn't use registers at all, all parameters are passed on the stack. MSVC does support thiscall/fastcall/vectorcall conventions that pass some values in registers, but the system APIs and COM interfaces all use stdcall.

Windows x64 and ARM64 do use register passing, with 4 registers for x64 (rcx/rdx/r8/r9) and 8 registers for ARM64 (x0-x7). Passing an additional parameter on the stack would be cheap compared to the workarounds that everyone has to do now.

ack_complete commented on Closures as Win32 Window Procedures   nullprogram.com/blog/2025... · Posted by u/ibobev
RossBencina · a day ago
> This is more work than going through GWLP_USERDATA

Indeed, aside from a party trick, why build an executable trampoline at runtime when you can store and retrieve the context, or a pointer to the context, with SetWindowLong() / GetWindowLong() [1]?

Slightly related: in my view Win32 windows are a faithful implementation of the Actor Model. The window proc of a window is mutable, it represents the current behavior, and can be changed in response to any received message. While I haven't personally seen this used in Win32 programs it is a powerful feature as it allows for implementing interaction state machines in a very natural way (the same way that Miro Samek promotes in his book.)

[1] https://learn.microsoft.com/en-us/windows/win32/api/winuser/...

ack_complete · a day ago
There's an annoying corner case when using SetWindowLongPtr/GetWindowLongPtr() -- Windows sends WM_GETMINMAXINFO before WM_NCCREATE. This can be worked around with a thread local, but a trampoline inherently handles it. Trampolines are also useful for other Win32 user functions that don't have an easy way to store context data, such as SetWindowsHookEx(). They're also slightly faster, though GetWindowLongPtr() at least seems able to avoid a syscall.

The code as written, though, is missing a call to FlushInstructionCache() and might not work in processes that prohibit dynamic code generation. An alternative is to just pregenerate an array of trampolines in a code segment, each referencing a mutable pointer in a parallel array in the data segment. These can be generated straightforwardly with a little template magic. This adds size to the executable unlike an empty RWX segment, but doesn't run afoul of any dynamic codegen restrictions or require I-cache flushing. The number of trampolines must be predetermined, but the RWX segment has the same limitation.

ack_complete commented on The stack circuitry of the Intel 8087 floating point chip, reverse-engineered   righto.com/2025/12/8087-s... · Posted by u/elpocko
rasz · 5 days ago
To the last point I would see it the other way around. Rearranging code for pipelined 0 cycle FXCH Pentium FPU speed up floating point by probably way more than x2 compared to heavily optimized code running on K5/K6. Im not even sure if K6/-2 ever got 0 cycle FXCH, K6-3 did, but still no FPU pipelining until Athlon.

Quake wouldnt happen until Pentium 2 if Intel didnt pipeline FPU.

ack_complete · 5 days ago
You're not wrong, the performance gain from proper FPU instruction scheduling on a Pentium was immense. But applications written prior to Quake and the Pentium gaining prominence or non-game oriented would have needed more blended code generation. Optimizing for the highest end CPU at the time at the cost of the lowest end CPU wouldn't necessarily have been a good idea, unless your lowest CPU was a Pentium. (Which it was for Quake, which was a slideshow on a 486.)

K6 did have the advantage of being OOO, which reduced the importance of instruction scheduling a lot, and having good integer performance. It also had some advantage with 3DNow! starting with K6-2, for the limited software that could use it.

ack_complete commented on The stack circuitry of the Intel 8087 floating point chip, reverse-engineered   righto.com/2025/12/8087-s... · Posted by u/elpocko
CaliforniaKarl · 5 days ago
I wonder, if C used Reverse-Polish notation for math operations, would compilers have been able to target the 8087 better than they did?
ack_complete · 5 days ago
Nah. As others have said, translating infix to RPN is pretty easy to do. The nasty part was keeping values within registers on the stack, especially within loops. The 8087 couldn't do binary ops between two arbitrary locations on the stack, one had to be the top of stack. This meant that if you need to add two non-top locations, for example, you had to exchange (FXCH) one of them to the top of the stack first. This meant that optimized x87 code tended to be a mess of FXCH instructions.

Complicating this further, doing this in a loop requires that the stack state match between the start and end of the loop. This can be challenging to do with minimal FXCH instructions. I've seen compilers emit 3+ FXCH instructions in a row at the end of a loop to match the stack state, where with some hairy rearrangement it was possible to get it down to 2 or 1.

Finally, the performance characteristics of different x87 implementations varied in annoying ways. The Intel Pentium, for instance, required very heavy use of FXCH to keep the add and multiply pipelines busy. Other x87 FPUs at the time, however, were non-pipelined, some taking 4 cycles for an FADD and another 4 cycles for FXCH. This meant that rearranging x87 code for Pentium could _halve_ the speed on other CPUs.

ack_complete commented on Cool-retro-term: terminal emulator which mimics look and feel of CRTs   github.com/Swordfish90/co... · Posted by u/michalpleban
poke646 · 20 days ago
It's almost like a caricature of a CRT. I can see the novelty, but hope that people aren't lead to believe monitors looked like this.

I think what bothers me most is the horizontal line that slowly moves across the screen every few seconds. It's an artifact of recording a CRT on film and doesn't occur when you look at a real monitor...

ack_complete · 20 days ago
It also happens with digital cameras for similar reasons, due to CCD scanning. But yeah, that doesn't happen looking directly at a CRT.

The bloom is also too blobby, because it's a gaussian blur. I ran into the same issue trying to implement a similar effect. The bloom shape needs to sharper to look realistic -- which also means unfortunately a non-separable blur.

ack_complete commented on Bypassing the Branch Predictor   nicula.xyz/2025/03/10/byp... · Posted by u/signa11
tehjoker · a month ago
if the branch is only taken once how can you realize a significant performance benefit more than a few ns?
ack_complete · a month ago
Cold branch comes to mind -- something like a interrupt handler, that is run often enough but not in high enough bursts.
ack_complete commented on LLM policy?   github.com/opencontainers... · Posted by u/dropbox_miner
CGamesPlay · a month ago
I make a lot of drive-by contributions, and I use AI coding tools. I submitted my first PR that is a cross between those two recently. It's somewhere between "vibe-coded" and "vibe-engineered", where I definitely read the resulting code, had the agent make multiple revisions, and deployed the result on my own infrastructure before submitting a PR. In the PR I clearly stated that it was done by a coding agent.

I can't imagine that any policy against LLM code would allow this sort of thing, but I also imagine that if I don't say "this was made by a coding agent", that no one would ever know. So, should I just stop contributing, or start lying?

[append] Getting a lot of hate for this, which I guess is a pretty clear answer. I guess the reason I'm not receiving the "fuck off" clearly is because when I see these threads of people complaining about AI content, it's really clearly low-quality crap that (for example) doesn't even compile, and wastes everyone's time.

I feel different from those cases because I did spend my time to resolve the issue for myself, did review the code, did test it, and do stand by what I'm putting under my name. Hmm.

ack_complete · a month ago
> I can't imagine that any policy against LLM code would allow this sort of thing, but I also imagine that if I don't say "this was made by a coding agent", that no one would ever know. So, should I just stop contributing, or start lying?

If a project has a stated policy that code written with an LLM-based aid is not accepted, then it shouldn't be submitted, same as with anything else that might be prohibited. If you attempt to circumvent this by hiding it and it is revealed that you knowingly did so in violation of the policy, then it would be unsurprising for you to receive a harsh reply and/or ban, as well as a revert if the PR was committed. This would be the same as any other prohibition, such as submitting code copied from another project with an incompatible license.

You could argue that such a blanket ban is unwarranted, and you might be right. But the project maintainers have a right to set the submission rules for their project, even if it rules out high-quality LLM assisted submissions. The right way to deal with this is to ask the project maintainers if they would be willing the adjust the policy, not to try to slip such code into the project anyway.

ack_complete commented on Windows 10 Deadline Boosts Mac Sales   macrumors.com/2025/10/25/... · Posted by u/akyuu
keyringlight · 2 months ago
The interesting thing about Kaby-lake/7th is that other CPUs in that generation are allowed by MS. While there is the extensions support aspect, after spectre/meltdown I think part of it was getting AMD/intel to sign up to providing firmware updates for product ranges over the win11 lifespan, then cycle will repeat again for win12.
ack_complete · 2 months ago
This was a late addition to the Windows 11 supported CPU list. The rumor is that this happened after it was pointed out that Microsoft was still selling brand new Surface Studio 2 devices that had 7th gen Intel CPUs.
ack_complete commented on When you opened a screen shot of a video in Paint, the video was playing in it   devblogs.microsoft.com/ol... · Posted by u/birdculture
ahartmetz · 2 months ago
Not necessary for blending in video overlays, and wasteful. Well, necessary inside the overlay if that is where the controls should appear. Alpha blending is two reads, one write per pixel, for the whole affected region (whatever that is, could be the whole screen). An opaque overlay is one read, one write, only for every pixel in the desired rectangle.
ack_complete · 2 months ago
The video overlays in question are not drawn by blending into a framebuffer in memory. They're two separate display planes that are read in parallel by the display path, scaled, and blended together at scan-out time. There are only reads, no writes. Modern GPUs support alpha-blended display planes using the alpha channel that is otherwise often required to exist anyway as padding.

As OP noted, using hardware display planes can have efficiency advantages for cases like floating controls over a video or smoothly animating a plane over a static background, since it avoids an extra read+write for the composited image. However, it also has some quirks -- like hardware bandwidth limits on how small a display plane can be scaled.

ack_complete commented on When you opened a screen shot of a video in Paint, the video was playing in it   devblogs.microsoft.com/ol... · Posted by u/birdculture
edgineer · 2 months ago
"Nowadays, video rendering is no longer done with overlays."

Darn, I thought this explained why, after upgrading my GPU, videos playing in Chrome have a thin green stripe on their right edge.

ack_complete · 2 months ago
A green stripe on the right/bottom is usually due to a different issue: interpolation errors in chroma planes when decoding YCbCr video. The chroma planes use a biased encoding where 0 (no color) is mapped to 128. A sloppy YCbCr to RGB conversion without proper clamping can interpolate against 0 at edges, which is interpreted as max negative chroma red/blue -- which combined together produces green. This can happen either due to an incorrectly padded texture or failing to handle the special final sample case for 4:2:2 chroma.

This issue can happen with overlays, but also non-overlay GPU drawing or CPU conversion routines.

u/ack_complete

KarmaCake day603January 8, 2022View Original