That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).
A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.
In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.
In my experience “oversubscribing” threads to cores (more threads than cores) provides a wall-clock time benefit.
I think one thread per core would work better without preemptive scheduling.
But then we aren’t talking about Unix.
This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.