In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.
So presumably, this comes down to...
- training technique or data
- dimension
- lower number of large experts vs higher number of small experts
In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.
Qwen Coder 30B is my main driver in this configuration and in my experience is quite capable. It runs at 80 tok/s on my M3 Max and I'm able to use it for about 30-50% of my coding tasks, the most menial ones. I am exploring ways to RL its approach to coding so it fits my style a bit more and it's a very exciting prospect whenever I manage to figure it out.
The missing link is autocomplete since Roo only solves the agent part. Continue.dev does a decent job at that but you really want to pair it with a high performance, large context model (so it fits multiple code sections + your recent changes + context about the repo and gives fast suggestions) and that doesn't seem feasible or enjoyable yet in a fully local setup.