When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.
This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.
If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3
e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720
I wrote some microbenchmarks for single-threaded memcpy
zen 2 (8-channel DDR4)
naive c:
17GB/s
non-temporal avx:
35GB/s
Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
naive c:
9GB/s
non-temporal avx:
13.5GB/s
apple silicon tests
(warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)
m3
naive c:
17GB/s cold, 41GB/s warm
non-temporal neon:
78GB/s cold+warm
m3 max
naive c:
25GB/s cold, 65GB/s warm
non-temporal neon:
49GB/s cold, 125GB/s warm
m4 pro
naive c:
13.8GB/s cold, 65GB/s warm
non-temporal neon:
49GB/s cold, 125GB/s warm
(I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)
I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon
Are those numbers also measured while skipping the CPU cache?