To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get
25t/s prompt processing
63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.
Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.
I get the same as well, instead I get this message, no matter which image I upload:
"This is a humorous meme that uses the phrase "one does not get it" in a mocking way. It's a joke about people getting frustrated when they don’t understand the context of a joke or meme."
It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.
As you are a photographer, using a picture from your website gemma 4b produces the following:
"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."
This description is pretty spot on.
The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.
n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens
(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)
(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!
I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.
Yep, exactly, just looped through each image with the same prompt and stored the results in a SQLite database to search through and maybe present more than a simple WebUI in the future.
It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.
It certainly seemed good enough for my use. I feed it some random images I found online, you can see the sort of metadata it outputs in a static dump here:
It's not perfect, by any means, but between the keywords and description text, it's good enough for me to be able to find images in a larger collection.
For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.
Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!
OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!
1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.
For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.
2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm
Steps to reproduce:
Then open http://127.0.0.1:8080/ for the web interfaceNote: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.
No, none of these things are in the images.
I don't even know how to begin debugging that.
Not sure why it's not working
https://github.com/ggml-org/llama.cpp/discussions/4167
I wonder if it's the encoder that isn't optimized?
As you are a photographer, using a picture from your website gemma 4b produces the following:
"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."
This description is pretty spot on.
The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.
(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)
(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)
Deleted Comment
want to have a look before I try
You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.
I made some quants with vision support - literally run:
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
Deleted Comment
Deleted Comment
Dead Comment
This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!
Similar to how we ended up with the huggingface/tokenizers library for text-only Tranformers.
Very nice for something that's self hosted.
If you want to see, here it is:
https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...
Here's an example of the kind of detail it built up for me for one image:
https://imgur.com/a/6jpISbk
It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.
https://q726kbxun.github.io/llama_cpp_vision/index.html
It's not perfect, by any means, but between the keywords and description text, it's good enough for me to be able to find images in a larger collection.
On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R) Or start the localhost 8080 web server (with a UI and API) like this: I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!
Llama-server allowing vision support is definitely super cool - was waiting for it for a while!
Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl
Any benefit on a Mac with apple silicon? Any experiences someone could share?
1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.
For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.
2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm