Readit News logoReadit News

Deleted Comment

lunixbochs commented on I/O is no longer the bottleneck? (2022)   stoppels.ch/2022/11/27/io... · Posted by u/benhoyt
eliasdejong · a month ago
> A key feature of this code is that it skips CPU cache when copying

Are those numbers also measured while skipping the CPU cache?

lunixbochs · a month ago
naive c is just a memcpy. non-temporal uses the streaming instructions.
lunixbochs commented on I/O is no longer the bottleneck? (2022)   stoppels.ch/2022/11/27/io... · Posted by u/benhoyt
eliasdejong · a month ago
Increasingly the performance limit for modern CPUs is the amount of data you can feed through a single core: basically memcpy() speed. On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.

When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.

This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.

If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3

lunixbochs · a month ago
your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth

e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720

I wrote some microbenchmarks for single-threaded memcpy

    zen 2 (8-channel DDR4)
    naive c:
      17GB/s
    non-temporal avx:
      35GB/s

    Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
    naive c:
      9GB/s
    non-temporal avx:
      13.5GB/s

    apple silicon tests
    (warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)

    m3
    naive c:
      17GB/s cold, 41GB/s warm
    non-temporal neon:
      78GB/s cold+warm

    m3 max 
    naive c:
      25GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    m4 pro
    naive c:
      13.8GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    (I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)
I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon

lunixbochs commented on Python numbers every programmer should know   mkennedy.codes/posts/pyth... · Posted by u/WoodenChair
lunixbochs · a month ago
I'm confused why they repeatedly call a slots class larger than a regular dict class, but don't count the size of the dict
lunixbochs commented on What makes you senior   terriblesoftware.org/2025... · Posted by u/mooreds
ChicagoDave · 2 months ago
I like the post but I’d add senior is also the instinct to take risks. I was once at a client in NY with an ASP.NET code base that used the compile at runtime capability (like Java used to). The C# source was being pushed to the web server.

I ran a compile and the code was riddled with errors. So I went to the PM and explained the code needed to compile and I needed a day to clean it up.

I refactored the entire project to compile and deploy that way. After that the development went very fast.

The hilarious part was the three devs who’d gone on vacation came back and thought what I’d done was “wrong”.

But the client said we (consultants) had done in two weeks what they couldn’t do in six months.

That’s what a senior engineer does.

lunixbochs · 2 months ago
I'm not familiar with C# compile at runtime. Are you saying your change was to do an AOT compile locally?
lunixbochs commented on Compressing Text into Images   shkspr.mobi/blog/2024/01/... · Posted by u/edent
lunixbochs · 2 years ago
I did a silly experiment to compress word embeddings with jpeg - to see how it collapses semantically as you decrease the quality.

https://bochs.info/vec2jpg/

This was a very basic experiment. I expect you could perform the DCT more intelligently on the vector dimensions instead of trying to pack the embeddings into pixels, and get higher quality semantic compression.

Deleted Comment

lunixbochs commented on StableLM: A new open-source language model   stability.ai/blog/stabili... · Posted by u/davidbarker
lhl · 3 years ago
FYI, I'm running lm-eval now w/ the tests Bellard uses (lambada_standard, hellaswag, winogrande, piqa,coqa) on the biggest 7B an 40GB A100 atm (non-quantized version, requires 31.4GB) so will be directly comparable to what various LLaMAs look like: https://bellard.org/ts_server/

(UPDATE: run took 1:36 to complete run, but failed at the end with a TypeError, so will need to poke and rerun).

I'll place results in my spreadsheet (which also has my text-davinci-003 results): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

Deleted Comment

lunixbochs commented on Box64 – Linux Userspace x86_64 Emulator Targeted at ARM64 Linux Devices   github.com/ptitSeb/box64... · Posted by u/varbhat
parasti · 3 years ago
Slightly off-topic, but the author also made gl4es, a library that basically allows all kinds of OpenGL apps to run on modern devices. Shameless plug: gl4es is what allowed me to port Neverball to the browser.

https://neverball.github.io

lunixbochs · 3 years ago
> made gl4es

Neverball was working in the original glshim project before ptitseb forked it to gl4es. (Not to discount the significant work he's put in since, including the ES2 backend)

u/lunixbochs

KarmaCake day1810April 25, 2012
About
@lunixbochs // hn@bochs.info // talonvoice.com - don't hurt your hands!
View Original