Const-me (u/Const-me)

Const-me commented on OpenBSD is so fast, I had to modify the program slightly to measure itself flak.tedunangst.com/post/... · Posted by u/Bogdanp

mananaysiempre · 9 days ago

This does not account for frequency scaling on laptops, context switches, core migrations, time spent in syscalls (if you don’t want to count it), etc. On Linux, you can get the kernel to expose the real (non-“reference”) cycle counter for you to access with __rdpmc() (no syscall needed) and put the corrective offset in an memory-mapped page. See the example code under cap_user_rdpmc on the manpage for perf_event_open() [1] and NOTE WELL the -1 in rdpmc(idx-1) there (I definitely did not waste an hour on that).

If you want that on Windows, well, it’s possible, but you’re going to have to do it asynchronously from a different thread and also compute the offsets your own damn self[2].

Alternatively, on AMD processors only, starting with Zen 2, you can get the real cycle count with __aperf() or __rdpru(__RDPRU_APERF) or manual inline assembly depending on your compiler. (The official AMD docs will admonish you not to assign meaning to anything but the fraction APERF / MPERF in one place, but the conjunction of what they tell you in other places implies that MPERF must be the reference cycle count and APERF must be the real cycle count.) This is definitely less of a hassle, but in my experience the cap_user_rdpmc method on Linux is much less noisy.

[1] https://man7.org/linux/man-pages/man2/perf_event_open.2.html

[2] https://www.computerenhance.com/p/halloween-spooktacular-day...

Const-me · 9 days ago

> does not account for frequency scaling on laptops

Are you sure about that?

> time spent in syscalls (if you don’t want to count it)

The time spent in syscalls was the main objective the OP was measuring.

> cycle counter

While technically interesting, most of the time I do my micro-benchmark I only care about wallclock time. Contradictory to what you see in search engines and ChatGPT, RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer. That instruction was counting CPU cycles like 20 years ago, doesn’t do that anymore.

Const-me commented on OpenBSD is so fast, I had to modify the program slightly to measure itself flak.tedunangst.com/post/... · Posted by u/Bogdanp

Const-me · 9 days ago

Not sure if that’s relevant, but when I do micro-benchmarks like that measuring time intervals way smaller than 1 second, I use __rdtsc() compiler intrinsic instead of standard library functions.

On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.

Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.

Const-me commented on Perfecting anti-aliasing on signed distance functions blog.pkh.me/p/44-perfecti... · Posted by u/ibobev

Const-me · 20 days ago

Good article, but I believe it lacks information what specifically these magical dFdx, dFdy, and fwidth = abs(dFdx) + abs(dFdy) functions are computing.

The following stackexchange answer addresses that question rather well: https://gamedev.stackexchange.com/a/130933/3355 As you see, dFdx and dFdx are not exactly derivatives, these are discrete screen-space approximations of these derivativities. Very cheap to compute due to the weird execution model of pixel shaders running in hardware GPUs.

Const-me commented on How to trigger a command on Linux when power switches from AC to battery dataswamp.org/~solene/202... · Posted by u/Mr_Minderbinder

Const-me · 24 days ago

Interestingly, a few months ago I wanted a similar thing for Windows. Ended up developing a simple tray utility for that. Probably the most important method is the handler of WM_POWERBROADCAST message: https://github.com/Const-me/SleepOnUnplug/blob/0.4/SleepOnUn...

Const-me commented on .NET 10 Preview 6 brings JIT improvements, one-shot tool execution infoworld.com/article/402... · Posted by u/breve

pjmlp · 25 days ago

On my little enterprise bubble, the only place remaining for C and C++ is writing native libraries to be consumed by managed languages, or bindings.

The last time I wrote pure C++ applications at work, was in 2005.

Libraries still regularly.

Const-me · 25 days ago

For the last few years, I’ve been developing CAM/CAE software on my job, sometimes embedded Linux.

Same experience: last time I developed software written entirely in C++ was in 2008. Nowadays, only using C++ for DLLs consumed by C#, for 2 reasons: manually vectorized SIMD for CPU bound numerical stuff, and consuming C or C++ libraries like GPU APIs, MS Media Foundation, or Eigen.

Const-me commented on 2025 Stack Overflow Developer Survey Results survey.stackoverflow.co/2... · Posted by u/colingw

wing-_-nuts · a month ago

C# as popular and commonly used as java? Hmm either I'm woefully behind the times, most java devs aren't answering SO surveys, or the data is wrong. Then again the fact that C# is used in indie game dev probably gives it a serious boost in the 'evangelical user' category. It is a nice language, I just find java to be far more used in industry.

Also, the editors? I'm sorry I've simply never gotten vsCode when jetbrains and neovim exist, much less N++.

Regardless I think I have to acknowledge that maybe I'm not your average dev. TBH your average dev is probably very happy coding up react widgets in vsCode, and I'm the grouchy greybeard behind the times.

Const-me · a month ago

I believe Java is more popular for enterprise and web apps.

.NET is widely used for videogames (games made with Unity, Godot, Unigine engines, also internal tools and game servers), desktop software, embedded software. Java is rarely used for these things, hence the results of that survey.

Const-me commented on I designed my own fast game streaming video codec – PyroWave themaister.net/blog/2025/... · Posted by u/Bogdanp

actionfromafar · a month ago

What I wonder is, how do you get the video frames to be compressed from the video card into the encoder?

The only frame capture APIs I know, take the image from the GPU, to CPU RAM, then you can put it back into the GPU for encoding.

Are there APIs which can sidestep the "load to CPU RAM" part?

Or is it implied, that a game streaming codec has to be implemented with custom GPU drivers?

Const-me · a month ago

> Are there APIs which can sidestep the "load to CPU RAM" part?

On windows that API is Desktop Duplication. The API delivers D3D11 textures, usually in BGRA8_UNORM format. When HDR is enabled you would need slightly different API method which can deliver HDR frames in RGBA16_FLOAT pixel format.

Const-me commented on I wasted weeks hand optimizing assembly because I benchmarked on random data vidarholen.net/contents/b... · Posted by u/thunderbong

menaerus · a month ago

I still don't understand if you're really sending byte by byte over TCP/UDP or? Because that would be detrimental for performance.

Const-me · a month ago

For the sending side, a small buffer (which should implement a flush method called after serializing the complete message) indeed helps amortize costs of system calls. However, a buffer large enough for 1GB messages will waste too much memory. Without such buffer on the sender side, it’s impossible to prepend every message with the length: you don’t know the length of a serialized messages until the entire message is serialized.

With streamed serialization, the receiver doesn’t know when the message will end. This generally means you can’t optimize LEB128 decoding by testing high bits of 10 bytes at once.

For example, let’s say the message is a long sequence of strings. Serialized into a sequence of pairs [ length, payload ] length is var.int, payload is an array of UTF8 bytes of that length, and the message is terminated with a zero-length string.

You can’t implement data parallel LEB128 decoder for the length field in that message by testing multiple bytes at once because it may consume bytes past the end of the message. A decoder for MKV variable integers only needs 1-2 read calls to decode even large numbers, because just the first byte contains the encoded length of the var.int.

Const-me commented on I wasted weeks hand optimizing assembly because I benchmarked on random data vidarholen.net/contents/b... · Posted by u/thunderbong

menaerus · a month ago

But you never serialize a byte by byte over the network. You encode, let's say, 1000 varints and then send them out on the wire. On the other end, you may or may not know how many varints there are to unpack but since you certainly have to know the length of the packet you can start decoding on a stream of bytes. Largest 64-bit leb128 number occupies 10 bytes whereas the largest 32-bit one occupies 5 bytes so we know the upper bound.

Const-me · a month ago

> you never serialize a byte by byte over the network

I sometimes do for the following two use cases.

1. When the protocol delivers real-time data, I serialize byte by byte over the network to minimize latency.

2. When the messages in question are large, like 1GB or more, I don’t want to waste that much memory on both ends of the socket to buffer complete messages like ProtoBuff does. On the sending side, I want to produce the message, encode at the same time with a streaming serializer, and stream bytes directly into the socket. Similarly, on the receiving side I want to de-serialize bytes as they come with a streaming de-serializer. Without buffering the complete message on sending side, receiving side can’t know length of the message in advance before receiving the complete message, because the sender only knows the length after the complete message is serialized.

Const-me commented on I wasted weeks hand optimizing assembly because I benchmarked on random data vidarholen.net/contents/b... · Posted by u/thunderbong

menaerus · a month ago

> LEB128 is slow and there’s no way around that.

Actually there is - you can exploit the data parallelism through SIMD to replace the logarithmic with near-constant complexity. Classic approach indeed is very slow and unfriendly to the CPU.

Const-me · a month ago

> Actually there is - you can exploit the data parallelism

That doesn’t help much when you’re streaming the bytes, like many parsers or de-serializers do. You have to read bytes from the source stream one by one because each of the next byte might be the last one of the integer being decoded. You could workaround with a custom buffering adapter. Hard to do correctly, costs performance, and in some cases even impossible: when the encoded integers are coming in realtime from network, trying to read next few bytes into a RAM buffer may block.

With MKV variable integers, you typically know the encoded length from just the first byte. Or in very rare cases of huge uint64 numbers, you need 2 bytes for that. Once you know the encoded length, you can read the rest of the encoded bytes with a single read function.

Another thing, unless you are running on a modern AMD64 CPU which supports PDEP/PEXT instructions (parts of BMI2 ISA extension), it’s expensive to split/merge the input bits to/from these groups of 7 bits in every encoded byte. MKV variable integers don’t do that, they only need bit scan and byte swap, both instructions are available on all modern mainstream processors including ARM and are very cheap.