Llama.cpp supports Vulkan. why doesn't Ollama?

So many here are trashing on Ollama, saying it's "just" nice porcelain around llama.cpp and it's not doing anything complicated. Okay. Let's stipulate that.

So where's the non-sketchy, non-for-profit equivalent? Where's the nice frontend for llama.cpp that makes it trivial for anyone who wants to play around with local LLMs without having to know much about their internals? If Ollama isn't doing anything difficult, why isn't llama.cpp as easy to use?

Making local LLMs accessible to the masses is an essential job right now—it's important to normalize owning your data as much as it can be normalized. For all of its faults, Ollama does that, and it does it far better than any alternative. Maybe wait to trash it for being "just" a wrapper until someone actually creates a viable alternative.

chown · 7 months ago

I totally agree with this. I wanted to make it really easy for non-technical users with an app that hid all the complexities. I basically just wanted to embed the engine without making users open their terminal, let alone make them configure. I started with llama.cpp amd almost gave up on the idea before I stumbled upon Ollama, which made the app happen[1]

There are many flaws in Ollama but it makes many things much easier esp. if you don’t want to bother building and configuring. They do take a long time to merge any PRs though. One of my PRs has been waiting for 8 months and there was this another PR about KV cache quantization that took them 6 months to merge.

[1]: https://msty.app

zozbot234 · 7 months ago

> They do take a long time to merge any PRs though.

I guess you have a point there, seeing as after many months of waiting we finally have a comment on this PR from someone with real involvement in Ollama - see https://github.com/ollama/ollama/pull/5059#issuecomment-2628... . Of course this is very welcome news.

smcleod · 7 months ago

That qkv PR was mine! Small world.

washadjeffmad · 7 months ago

>So where's the non-sketchy, non-for-profit equivalent

llama.cpp, kobold.cpp, oobabooga, llmstudio, etc. There are dozens at this point.

And while many chalk the attachment to ollama up to a "skill issue", that's just venting frustration that all something has to do to win the popularity contest is to repackage and market it as an "app".

I prefer first-party tools, I'm comfortable managing a build environment and calling models using pytorch, and ollama doesn't really cover my use cases, so I'm not it's audience. I still recommend it to people who might want the training wheels while they figure out how not-scary local inference actually is.

woadwarrior01 · 7 months ago

> llmstudio

ICYMI, you might want to read their terms of use:

https://lmstudio.ai/terms

evilduck · 7 months ago

> llama.cpp, kobold.cpp, oobabooga

None of these three are remotely as easy to install or use. They could be, but none of them are even trying.

> lmstudio

This is a closed source app with a non-free license from a business not making money. Enshittification is just a matter of when.

Aurornis · 7 months ago

It’s so hard to decipher the complaints about ollama in this comment section. I keep reading comments from people saying they don’t trust it, but then they don’t explain why they don’t trust it and don’t answer any follow up questions.

As someone who doesn’t follow this space, it’s hard to tell if there’s actually something sketchy going on with ollama or if it’s the usual reactionary negativity that happens when a tool comes along and makes someone’s niche hobby easier and more accessible to a broad audience.

bloomingkales · 7 months ago

they don’t explain why they don’t trust it

We need to know a few things:

1) Show me the lines of code that log things and how it handles temp files and storage.

2) No remote calls at all.

3) No telemetry at all.

This is the feature list I would want to begin trusting. I use this stuff, but I also don’t trust it.

traverseda · 7 months ago

>So where's the non-sketchy, non-for-profit equivalent?

Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.

That, coupled with the increasing scale of the internet, make it harder and harder for smaller groups to do these kinds of things. At least until we get some good content addressed distributed storage system.

woadwarrior01 · 7 months ago

> Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.

Cloudflare R2 has unlimited egress, and AFAIK, that's what ollama uses for hosting quantized model weights.

Deleted Comment

buyucu · 7 months ago

supporting vulkan will help ollama reach the masses who don't have dedicated gpus from nvidia.

this is such a low hanging fruit that it's silly how they are acting.

lolinder · 7 months ago

As has been pointed out in this thread in a comment that you replied to (so I know you saw it) [0], Ollama goes to a lot of contortions to support multiple llama.cpp backends. Yes, their solution is a bit of a hack, but it means that the effort to adding a new back end is substantial.

And again, they're doing those contortions to make it easy for people. Making it easy involves trade-offs.

Yes, Ollama has flaws. They could communicate better about why they're ignoring PRs. All I'm saying is let's not pretend they're not doing anything complicated or difficult when no one has been able to recreate what they're doing.

[0] https://news.ycombinator.com/item?id=42886933

bestcoder69 · 7 months ago

Llamafile: https://github.com/Mozilla-Ocho/llamafile

lolinder · 7 months ago

Llamafile is great but solves a slightly different problem very well: how do I easily download and run a single model without having any infrastructure in place first?

Ollama solves the problem of how I run many models without having to deal with many instances of infrastructure.

homebrewer · 7 months ago

It's actually more difficult to use on linux (compared to ollama) because of the weird binfmt contortions you have to go through.

axegon_ · 7 months ago

I think you are missing the point. To get things straight: llama.cpp is not hard to setup and get running. It was a bit of a hassle in 2023 but even then it was not catastrophically complicated if you were willing to read the errors you were getting. People are dissatisfied for two, very valid reasons: ollama gives little to no credit to llama.cpp. The second one is the point of the post: a PR has been open for over 6 months and not a huge PR at that has been completely ignored. Perhaps the ollama maintainers personally don't have use for it so they shrugged it off but this is the equivalent of "it works on my computer". Imagine if all kernel devs used Intel CPUs and ignored every non-intel CPU-related PR. I am not saying that the kernel mailing list is not a large scale version of a countryside pub on a Friday night - it is. But the maintainers do acknowledge the efforts of people making PRs and do a decent job at addressing them. While small, the PR here is not trivial and should have been, at the very least, discussed. Yes, the workstation/server I use for running models uses two Nvidia GPU's. But my desktop computer uses an Intel Arc and in some scenarios, hypothetically, this pr might have been useful.

lolinder · 7 months ago

> To get things straight: llama.cpp is not hard to setup and get running. It was a bit of a hassle in 2023 but even then it was not catastrophically complicated if you were willing to read the errors you were getting.

It's made a lot of progress in that the README [0] now at least has instructions for how to download pre-built releases or docker images, but that requires actually reading the section entitled "Building the Project" to realize that it provides more than just building instructions. That is not accessible to the masses, and it's hard for me to not see that placement and prioritization as an intentional choice to be inaccessible (which is a perfectly valid choice for them!)

And that's aside from the fact that Ollama provides a ton of convenience features that are simply missing, starting with the fact that it looks like with llama.cpp I still have to pick a model at startup time, which means switching models requires SSHing into my server and restarting it.

None of this is meant to disparage llama.cpp: what they're doing is great and they have chosen to not prioritize user convenience as their primary goal. That's a perfectly valid choice. And I'm also not defending Ollama's lack of acknowledgment. I'm responding to a very specific set of ideas that have been prevalent in this thread: that not only does Ollama not give credit, they're not even really doing very much "real work". To me that is patently nonsense—the last mile to package something in a way that is user friendly is often at least as much work, it's just not the kind of work that hackers who hang out on forums like this appreciate.

[0] https://github.com/ggerganov/llama.cpp

portaouflop · 7 months ago

llama.ccp is hard to set up - I develop software for a living and it wasn’t trivial for me. ollama I can give to my non-technical family members and they know how to use it.

As for not merging the PR - why are you entitled to have a PR merged? This attitude of entitlement around contributions is very disheartening as oss maintainer - it’s usually more work to review/merge/maintain a feature etc than to open a PR. Also no one is entitled to comments / discussion or literally one second of my time as an OSS maintainer. This is imo the cancer that is eating open source.

pepijndevos · 7 months ago

ramalama seems to be trying, it's a docker based approach.

airstrike · 7 months ago

Here you go: https://github.com/hecrj/icebreaker

lolinder · 7 months ago

> No pre-built binaries yet! Use cargo to try it out

Not an equivalent yet, sorry.

Ollama needs competition. I’m not sure what drives the people that maintain it but some of their actions imply that there are ulterior motives at play that do not have the benefit of their users in mind.

However such projects require a lot of time and effort and it’s not clear if this project can be forked and kept alive.

Deathmax · 7 months ago

The most recent one of the top of my head is their horrendous aliasing of DeepSeek R1 on their model hub, misleading users into thinking they are running the full model but really anything but the 671b alias is one of the distilled models. This has already led to lots of people claiming that they are running R1 locally when they are not.

TeMPOraL · 7 months ago

The whole DeepSeek-R1 situation gets extra confusing because:

- The distilled models are also provided by DeepSeek;

- There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.

I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.

[0] - https://unsloth.ai/blog/deepseekr1-dynamic

adastra22 · 7 months ago

I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.

blixt · 7 months ago

LM Studio has been around for a long time and does a lot of similar things but with a more UI-based approach. I used to use it before Ollama, and seems it's still going strong. https://lmstudio.ai/

buyucu · 7 months ago

isn't lm stuido closed source?

7thpower · 7 months ago

Can you please explain why you think they may be operating in bad faith?

diggan · 7 months ago

Not parent, but same feeling.

First I got the feeling because of how they store things on disk and try to get all models rehosted in their own closed library.

Second time I got the feeling is when it's not obvious at all about what their motives are, and that it's a for-profit venture.

Third time is trying to discuss things in their Discord and the moderators there constantly shut down a lot of conversation citing "Misinformation" and rewrites your messages. You can ask a honest question, it gets deleted and you get blocked for a day.

Just today I asked why the R1 models they're shipping that are the distilled ones, doesn't have "distilled" in the name, or even any way of knowing which tag is which model, and got the answer "if you don't like how things are done on Ollama, you can run your own object registry" which doesn't exactly inspire confidence.

Another thing I noticed after a while is that there are bunch of people with zero knowledge of terminals that want to run Ollama, even though Ollama is a project for developers (since you do need to know how to run a terminal). Just making the messaging clearer would help a lot in this regarding, but somehow the Ollama team thinks thats gatekeeping and it's better to teach people basic terminal operations.

justinmayer · 7 months ago

Benefiting users is definitely not Ollama’s first priority, as seen when this pull request was summarily closed: https://github.com/jmorganca/ollama/pull/395

Those README changes only served to provide greater transparency to would-be users.

Ulterior motives, indeed.

prabir · 7 months ago

There is https://cortex.so/ that I’m looking forward too.

adastra22 · 7 months ago

Hey thanks, I didn't know about cortex and this looks perfect.

imtringued · 7 months ago

Ollama doesn't really need competition. Llama.cpp just needs a few usability updates to the gguf format so that you can specify a hugging face repository like you can do in vLLM already.

buyucu · 7 months ago

I totally agree that ollama needs competition. They have been doing very sketchy things lately. I wish llama.cpp had an alternative wrapper client like ollama.

Liquix · 7 months ago

agreed. but what's wrong with Jan? does ollama utilize resources/run models more efficiently under the hood? (sorry for the naivete)