Readit News logoReadit News
dust42 · 8 months ago
To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get

  25t/s prompt processing 
  63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.

Steps to reproduce:

  git clone https://github.com/ggml-org/llama.cpp.git
  cmake -B build
  cmake --build build --config Release -j 12 --clean-first
  # download model and mmproj files...
  build/bin/llama-server \
    --model gemma-3-4b-it-Q4_K_M.gguf \
    --mmproj mmproj-model-f16.gguf
Then open http://127.0.0.1:8080/ for the web interface

Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.

I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.

matja · 8 months ago
For every image I try, I get the same response:

> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.

No, none of these things are in the images.

I don't even know how to begin debugging that.

clueless · 8 months ago
I get the same as well, instead I get this message, no matter which image I upload: "This is a humorous meme that uses the phrase "one does not get it" in a mocking way. It's a joke about people getting frustrated when they don’t understand the context of a joke or meme."

Not sure why it's not working

exe34 · 8 months ago
Means it can't see the actual image. It's not loading for some reason.
brrrrrm · 8 months ago
hmm, I'm getting the same results - but I see on M1 with a 7b model we should expect ~10x faster prompt processing

https://github.com/ggml-org/llama.cpp/discussions/4167

I wonder if it's the encoder that isn't optimized?

zamadatix · 8 months ago
Are those numbers for the 4/8 bit quants or the full fp16?
dust42 · 8 months ago
It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.

As you are a photographer, using a picture from your website gemma 4b produces the following:

"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."

This description is pretty spot on.

The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.

refulgentis · 8 months ago
n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens

(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)

(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)

Deleted Comment

astrodude · 8 months ago
do you have any example images it generated based on your prompts?

want to have a look before I try

geoffpado · 8 months ago
To be clear, this model isn't generating images, it's describing images that are sent to it.
danielhanchen · 8 months ago
It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support - literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1

Then load the image with /image image.png inside the chat, and chat away!

EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.

danielhanchen · 8 months ago
If it helps, I updated https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-t... to show you can use llama-mtmd-cli directly - it should work for Mistral Small as well

Deleted Comment

distalx · 8 months ago
Is there a simple GUI available for running LLaMA on my desktop that I can access from my laptop?
thenameless7741 · 8 months ago
If you install llama.cpp via Homebrew, llama-mtmd-cli is already included. So you can simply run `llama-mtmd-cli <args>`
danielhanchen · 8 months ago
Oh even better!!

Deleted Comment

danielhanchen · 8 months ago
Ok it's actually better to use -ngl 99 and not -ngl -1. -1 might or might not work!
raffraffraff · 8 months ago
I can't see the letters "ngl" anymore without wanting to punch something.
simlevesque · 8 months ago
That's your problem. Hope you do something about that pent up aggressivity.
danielhanchen · 8 months ago
Oh it's shorthand for number of layers to offload to the GPU for faster inference :) but yes it's probs not the best abbreviation.

Dead Comment

ngxson · 8 months ago
We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

a_e_k · 8 months ago
I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!
thatspartan · 8 months ago
Thanks for landing the mtmd functionality in the server. Like the other commenter I kept poring over commits in anticipation.
moffkalast · 8 months ago
Ok but what's the quality of the high speed response? Can the sub-2.2B ones output a coherent sentence?
simonw · 8 months ago
This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...
scribu · 8 months ago
It’s interesting that they decided to move all of the architecture-specific image-to-embedding preprocessing into a separate library.

Similar to how we ended up with the huggingface/tokenizers library for text-only Tranformers.

banana_giraffe · 8 months ago
I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.

Very nice for something that's self hosted.

accrual · 8 months ago
That's pretty neat. Do you essentially loop over a list of images and run the prompt for each, then store the result somewhere (metadata, sqlite)?
banana_giraffe · 8 months ago
Yep, exactly, just looped through each image with the same prompt and stored the results in a SQLite database to search through and maybe present more than a simple WebUI in the future.

If you want to see, here it is:

https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...

Here's an example of the kind of detail it built up for me for one image:

https://imgur.com/a/6jpISbk

It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.

buyucu · 8 months ago
is gemma 4b good enough for this? I was playing with larger versions of gemma because I didn't think 4b would be any good.
banana_giraffe · 8 months ago
It certainly seemed good enough for my use. I feed it some random images I found online, you can see the sort of metadata it outputs in a static dump here:

https://q726kbxun.github.io/llama_cpp_vision/index.html

It's not perfect, by any means, but between the keywords and description text, it's good enough for me to be able to find images in a larger collection.

simonw · 8 months ago
llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332

On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:

  unzip llama-b5332-bin-macos-arm64.zip
  cd build/bin
  sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)

  ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this:

  ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/

ngxson · 8 months ago
For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.

Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!

danielhanchen · 8 months ago
I'm also extremely pleased with convert_hf_to_gguf.py --mmproj - it makes quant making much simpler for any vision model!

Llama-server allowing vision support is definitely super cool - was waiting for it for a while!

ngxson · 8 months ago
And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl

danielhanchen · 8 months ago
OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!
thenthenthen · 8 months ago
What has changed in laymans terms? I tried llama.cpp a few months ago and it could already do image description etc?
nico · 8 months ago
How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

ngxson · 8 months ago
Two things:

1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm

nolist_policy · 8 months ago
On the other hand ollama supports iSWA for Gemma 3 while llama.cpp doesn't. iSWA reduces kv cache size to 1/6.
roger_ · 8 months ago
Won’t the changes eventually be added to ollama? I thought it was based on llama.cpp
danielhanchen · 8 months ago
By the way - fantastic work again on llama.cpp vision support - keep it up!!