One thing I found out is that getting calibrated accuracy beyond 0.1% is hard and expensive despite having all that precision.
This is contrary to what I've seen in a large ML shop, where architectural tuning was king.
Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).
There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.
I understand that you can get highly power efficient XORs, for example. But if we go down this path, would they help with a matrix multiply? Or the bias term of a FFN? Would there be any improvement (i.e. is there anything to offload) in regular business logic? Should I think of it as a more efficient but highly limited DSP? Or a fixed function accelerator replacement (e.g. "we want to encrypt this segment of memory")
The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.
So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.
Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.
This blog post explains it well. https://www.adamcasson.com/posts/transformer-flops
So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.