It's a balancing game, how slow a token generation speed can you tolerate? Would you rather get an answer quick, or wait for a few seconds (or sometimes minutes) for reasoning?
For quick answers, Gemma 3 12B is still good. GPT-OSS 20B is pretty quick when reasoning is set to low, which usually doesn't think longer than one sentence. I haven't gotten much use out of Qwen3 4B Thinking (2507) but at least it's fast while reasoning.
Luckily llama.cpp has come a long way and was at a point that I could easily recommend as the open source option instead.
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
sudo sysctl iogpu.wired_limit_mb=184320
Source: https://github.com/ggml-org/llama.cpp/discussions/15396