Shout out to everyone from Ollama and the wider community that helped with the reviews, feedback and assistance along the way. It's great to contribute to such a fantastic project.
Today I ran some perplexity benchmarks comparing F16 and Q8_0 for the K/V, I used Qwen 2.5 Coder 7b as I've heard people say things to the effect of Qwen being more sensitive to quantisation than some other models.
Well, it turns out there's barely any increase in perplexity at all - an increase of just 0.0043.
What's the best way to use Ollama with a GUI, just OpenWebUI? Any options as well for mobile platforms like Android (or, I don't even know if we can run LLMs on the phone in the first place).
I have Open WebUI on a Hetzner instance connected to Deep Infra. Works on mobile by turning the web page into an app. I find the web framework that WebUI uses quite bloated/slow, but apart from that it does work reliably. Price at Deep Infra is typically about $0.04 per month even when actively asking lots of questions during programming.
A lot of the UIs, including OpenWebUI have the feature to expose over LAN with users - that's what I did to use my GPU while still being on my phone. Not entirely sure about native UIs though.
Also, I normally use Groq's (with a q) API since it's really cheap with no upfront billing info required - it's a whole order of magnitude cheaper iirc than OpenAI/Claude. They literally have a /openai endpoint if you need compatibility.
You can look in the direction of Google's Gemma if you need a lightweight open weights LLM - there was something there that I forgot.
I personally use a mix of Open WebUI, Big AGI, BoltAI, AnythingLLM on the desktop. The mobile space is where things are really lacking at the moment, really I just end up browsing to Open WebUI but that's not ideal. I'd love a iOS native client that's well integrated into Siri, Shortcuts, Sharing etc...
Aa a Windows user, who just wanted something bare bones for playing, I found this[1] small project useful. It does support multi-modal models which is nice.
As far as open source goes, I'd probably recommend LibreChat. It has connections for ollama, openai, anthropic, etc. It let's you setup auth so you can theoretically use it from anywhere (phone, etc.).
Fair warning, it's relatively heavyweight in so far as it has to spin up a number of docker instances but works very well.
Many are there, apart from what others mentioned I am exploring
Anything LLM - https://anythingllm.com/. Liked the workspace concept in it. We can club documents in workspaces and RAG scope is managed.
> The journey to integrate K/V context cache quantisation into Ollama took around 5 months.
??
They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:
The commit (1) is:
> params := C.llama_context_default_params()
> ...
> params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
> params.type_v = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
Which has been part of llama.cpp since Dev 7, 2023 (2).
So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.
> It took 5 months, but we got there in the end.
... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?
Author describes why it took as long as it did in the post, so I don't think they're trying to be disingenous. Getting minor changes merged upstream in large projects is difficult for newer concepts since you need adoption and support.
Full release seems to contain more code[1], and author references the llama.cpp pre-work and that author as well
This person is also not a core contributor, so this reads as a hobbyist and fan of AI dev that is writing about their work. Nothing to be ashamed of IMO.
> this reads as a hobbyist and fan of AI dev that is writing about their work
Bingo, that's me!
I suspect the OP didn't actually read the post.
1. As you pointed out, it's about getting the feature working, enabled and contributed into Ollama, not in llama.cpp
2. Digging through git commits isn't useful when you work hard to squash commits before merging a PR, there were a _lot_ over the last 5 months.
3. While I'm not a go dev (and the introduction of cgo part way through that threw me a bit) there certainly were technicalities along the way, I suspect they not only didn't both to read the post, they also didn't bother to read the PR.
Also, just to clarify - I didn't even share this here, it's just my personal blog of things I try to remember I did when I look back at them years later.
I'm going to be generous here and assume you didn't bother to actually read the post (or even the PR) before writing a snaky, non-constructive comment, but skimming through your HN comment history this appears to be on-brand.
I'll be generous and just say, maybe people should just use llama.cpp and not ollama if they care about having nice things, if merging support for existing features is that difficult.
It seems like it's probably a better choice overall.
That said, I'm sure people worked very hard on this, and it's nice to see it as a part of ollama for the people that use it.
Also:
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
Well, it turns out there's barely any increase in perplexity at all - an increase of just 0.0043.
Added to the post: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...
It really convenient because it's just an SSH tunnel and then you get automatic TLS and it protects your home IP.
With that you can access it from your mobile phone, just gotta require a password to access it.
[0] https://github.com/minwidth0px/gpt-playground and https://github.com/minwidth0px/Webrtc-NAT-Traversal-Proxy-Se...
[1] https://github.com/sinclairzx81/smoke
Also, I normally use Groq's (with a q) API since it's really cheap with no upfront billing info required - it's a whole order of magnitude cheaper iirc than OpenAI/Claude. They literally have a /openai endpoint if you need compatibility.
You can look in the direction of Google's Gemma if you need a lightweight open weights LLM - there was something there that I forgot.
[1]: https://github.com/jakobhoeg/nextjs-ollama-llm-ui
https://prompt.16x.engineer/
Should work well if you have 64G vRAM to run SOTA models locally.
Does anyone have this?
edit: Ah, it's a Mac app.
Deleted Comment
Fair warning, it's relatively heavyweight in so far as it has to spin up a number of docker instances but works very well.
https://github.com/danny-avila/LibreChat
Deleted Comment
(Not sure if it uses ollama though)
Deleted Comment
That said... I mean...
> The journey to integrate K/V context cache quantisation into Ollama took around 5 months.
??
They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:
The commit (1) is:
Which has been part of llama.cpp since Dev 7, 2023 (2).So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.
> It took 5 months, but we got there in the end.
... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?
[1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...
[2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...
Full release seems to contain more code[1], and author references the llama.cpp pre-work and that author as well
This person is also not a core contributor, so this reads as a hobbyist and fan of AI dev that is writing about their work. Nothing to be ashamed of IMO.
[1] - https://github.com/ollama/ollama/compare/v0.4.7...v0.4.8-rc0
Bingo, that's me!
I suspect the OP didn't actually read the post.
1. As you pointed out, it's about getting the feature working, enabled and contributed into Ollama, not in llama.cpp
2. Digging through git commits isn't useful when you work hard to squash commits before merging a PR, there were a _lot_ over the last 5 months.
3. While I'm not a go dev (and the introduction of cgo part way through that threw me a bit) there certainly were technicalities along the way, I suspect they not only didn't both to read the post, they also didn't bother to read the PR.
Also, just to clarify - I didn't even share this here, it's just my personal blog of things I try to remember I did when I look back at them years later.
Ollama is IMO a model downloader for llama.cpp so you can do roleplay with ease.
It seems like it's probably a better choice overall.
That said, I'm sure people worked very hard on this, and it's nice to see it as a part of ollama for the people that use it.
Also:
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
https://news.ycombinator.com/newsguidelines.html
Deleted Comment
Deleted Comment