Mistral Integration Improved in Llama.cpp

flakiness · 13 days ago

> We are using mistral-common internally for tokenization and want the community to use it to unlock full capacities of our models. As mistral-common is a Python library, we have opened a PR to add a REST API via FastAPI to make it easier for users who are not in the Python ecosystem.

A cpp binary depending on a python server is a bit sad.

I hope this is a stopgap measure and someone port it to C++ eventually:https://github.com/mistralai/mistral-common/blob/main/src/mi...

the_mitsuhiko · 13 days ago

Isn’t llama.cpp already depending on Python anyways for the templating?

Maxious · 13 days ago

It uses a cpp implementation of jinja https://github.com/google/minja

hodgehog11 · 14 days ago

I appreciate Mistral (and others) releasing their weights for free. But given how llama.cpp underpins a lot of the programs which allow users to run open weight models, it is a little frustrating to have companies which brag about releasing models to the community, leave the community to their own devices to slowly try and actually implement their models.

I hear the reason for this is that llama.cpp keeps breaking basic things, so they have become an unreliable partner. Seems this is what Ollama is trying to address by diluting their connections to llama.cpp and directly contacting companies training these models to have simultaneous releases (e.g. GPT-OSS).

mattnewton · 14 days ago

There are many different inference libraries and it's not clear which ones a small company like mistral should back yet IMO.

They do release high quality inference code, ie https://github.com/mistralai/mistral-inference

bastawhiz · 14 days ago

There's more to it, though. The inference code you linked to is Python. Unless my software is Python, I have to ship a CPython binary to run the inference code, then wire it up (or port it, if you're feeling spicy).

Ollama brings value by exposing an API (literally over sockets) with many client SDKs. You don't even need the SDKs to use it effectively. If you're writing Node or PHP or Elixir or Clojurescript or whatever else you enjoy, you're probably covered.

It also means that you can swap models trivially, since you're essentially using the same API for each one. You never need to worry about dependency hell or the issues involved in hosting more than one model at a time.

As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature.

refulgentis · 13 days ago

Nah, llama.cpp is stable.

llama.cpp also got GPT-OSS early, like Ollama.

There's a lot of extremely subtle politics going on in the link.

Suffice it to say, as a commercial entity, there's a very clever way to put your thumb on the scale of what works and what doesn't without it being obvious to anyone involved, even the thumb.

hodgehog11 · 13 days ago

Stable for a power user, or stable for everyone? I don't have links on hand, but I could swear there have been instances where certain models rolled back support during llama.cpp development, and this was recent. Also llama.cpp adds features and support on a near-daily basis, how can this be LTS?

Don't get me wrong, llama.cpp is an amazing tool. But it's development is nowhere near as cautious as something like the Linux kernel, so there is room there for a more stable alternative. Not saying Ollama will do this, but llama.cpp won't be everything to everyone.

mhitza · 13 days ago

llama.cpp still doesn't support gpt-oss tool calling. https://github.com/ggml-org/llama.cpp/pull/15158 (among other similar PRs)

But I also couldn't get vllm, or transformers serve, or ollama (400 response on /v1/chat/completions) working today with gpt-oss. OpenAI's cookbooks aren't really copy paste instructions. They probably tested on a single platform with preinstalled python packages which they forgot to mention :))

baggiponte · 14 days ago

Wow I never realized how much mistral was “disconnected” from the ecosystem