Deleted Comment
This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!
llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUFI have no idea how to specify custom layer specs with multi GPU, but that is interesting!
On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:
unzip llama-b5332-bin-macos-arm64.zip
cd build/bin
sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R) ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this: ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl
On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:
unzip llama-b5332-bin-macos-arm64.zip
cd build/bin
sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R) ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this: ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!
Any benefit on a Mac with apple silicon? Any experiences someone could share?
1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.
For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.
2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm
Small correct, I'm not just asking it to convert ARM NEON to SIMD, but for the function handling q6_K_q8_K, I asked it to reinvent a new approach (without giving it any prior examples). The reason I did that was because it failed writing this function 4 times so far.
And a bit of context here, I was doing this during my Sunday and the time budget is 2 days to finish.
I wanted to optimize wllama (wasm wrapper for llama.cpp that I maintain) to run deepseek distill 1.5B faster. Wllama is totally a weekend project and I can never spend more than 2 consecutive days on it.
Between 2 choices: (1) to take time to do it myself then maybe give up, or (2) try prompting LLM to do that and maybe give up (at worst, it just give me hallucinated answer), I choose the second option since I was quite sleepy.
So yeah, turns out it was a great success in the given context. Just does it job, saves my weekend.
Some of you may ask, why not trying ChatGPT or Claude in the first place? Well, short answer is: my input is too long, these platforms straight up refuse to give me the answer :)
(See the code in side llama_model_default_params())