Bringing K/V context quantisation to Ollama

What's the best way to use Ollama with a GUI, just OpenWebUI? Any options as well for mobile platforms like Android (or, I don't even know if we can run LLMs on the phone in the first place).

qudat · 9 months ago

For hosting a web gui for ollama I use https://tuns.sh

It really convenient because it's just an SSH tunnel and then you get automatic TLS and it protects your home IP.

With that you can access it from your mobile phone, just gotta require a password to access it.

satvikpendem · 9 months ago

I'm running OpenWebUI via Docker via OrbStack, it also automatically provides TLS and works pretty well.

minwidth0px · 9 months ago

I wrote my own UI[0] that connects over WebRTC using Smoke.[1]

[0] https://github.com/minwidth0px/gpt-playground and https://github.com/minwidth0px/Webrtc-NAT-Traversal-Proxy-Se...

[1] https://github.com/sinclairzx81/smoke

huijzer · 9 months ago

I have Open WebUI on a Hetzner instance connected to Deep Infra. Works on mobile by turning the web page into an app. I find the web framework that WebUI uses quite bloated/slow, but apart from that it does work reliably. Price at Deep Infra is typically about $0.04 per month even when actively asking lots of questions during programming.

sadeshmukh · 9 months ago

A lot of the UIs, including OpenWebUI have the feature to expose over LAN with users - that's what I did to use my GPU while still being on my phone. Not entirely sure about native UIs though.

Also, I normally use Groq's (with a q) API since it's really cheap with no upfront billing info required - it's a whole order of magnitude cheaper iirc than OpenAI/Claude. They literally have a /openai endpoint if you need compatibility.

You can look in the direction of Google's Gemma if you need a lightweight open weights LLM - there was something there that I forgot.

rkwz · 9 months ago

If you’re using a Mac, I’ve built a lightweight native app - https://github.com/sheshbabu/Chital

antirez · 9 months ago

That's very cool, finally a native app that runs fast. Thanks.

smcleod · 9 months ago

I personally use a mix of Open WebUI, Big AGI, BoltAI, AnythingLLM on the desktop. The mobile space is where things are really lacking at the moment, really I just end up browsing to Open WebUI but that's not ideal. I'd love a iOS native client that's well integrated into Siri, Shortcuts, Sharing etc...

magicalhippo · 9 months ago

Aa a Windows user, who just wanted something bare bones for playing, I found this[1] small project useful. It does support multi-modal models which is nice.

[1]: https://github.com/jakobhoeg/nextjs-ollama-llm-ui

paradite · 9 months ago

I built a custom GUI for coding tasks specifically, with built-in code context management and workspaces:

https://prompt.16x.engineer/

Should work well if you have 64G vRAM to run SOTA models locally.

accrual · 9 months ago

Great looking GUI, I find simple black/white/boxy/monospace UIs very effective.

throwaway314155 · 9 months ago

> Should work well if you have 64G vRAM to run SOTA models locally.

Does anyone have this?

edit: Ah, it's a Mac app.

Deleted Comment

vunderba · 9 months ago

As far as open source goes, I'd probably recommend LibreChat. It has connections for ollama, openai, anthropic, etc. It let's you setup auth so you can theoretically use it from anywhere (phone, etc.).

Fair warning, it's relatively heavyweight in so far as it has to spin up a number of docker instances but works very well.

https://github.com/danny-avila/LibreChat

zerop · 9 months ago

Many are there, apart from what others mentioned I am exploring Anything LLM - https://anythingllm.com/. Liked the workspace concept in it. We can club documents in workspaces and RAG scope is managed.

Deleted Comment

seb314 · 9 months ago

For running llms _locally_ on Android, there's "pocketpal" (~7tok/s on a pixel 7 pro for some quant of llama 3.2 3B).

(Not sure if it uses ollama though)

Nice.

That said... I mean...

> The journey to integrate K/V context cache quantisation into Ollama took around 5 months.

They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:

The commit (1) is:

    > params := C.llama_context_default_params()
    > ...
    > params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
    > params.type_v = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this

Which has been part of llama.cpp since Dev 7, 2023 (2).

So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.

> It took 5 months, but we got there in the end.

... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?

[1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...

[2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...

meesles · 9 months ago

Author describes why it took as long as it did in the post, so I don't think they're trying to be disingenous. Getting minor changes merged upstream in large projects is difficult for newer concepts since you need adoption and support.

Full release seems to contain more code[1], and author references the llama.cpp pre-work and that author as well

This person is also not a core contributor, so this reads as a hobbyist and fan of AI dev that is writing about their work. Nothing to be ashamed of IMO.

[1] - https://github.com/ollama/ollama/compare/v0.4.7...v0.4.8-rc0

smcleod · 9 months ago

> this reads as a hobbyist and fan of AI dev that is writing about their work

Bingo, that's me!

I suspect the OP didn't actually read the post.

1. As you pointed out, it's about getting the feature working, enabled and contributed into Ollama, not in llama.cpp

2. Digging through git commits isn't useful when you work hard to squash commits before merging a PR, there were a _lot_ over the last 5 months.

3. While I'm not a go dev (and the introduction of cgo part way through that threw me a bit) there certainly were technicalities along the way, I suspect they not only didn't both to read the post, they also didn't bother to read the PR.

Also, just to clarify - I didn't even share this here, it's just my personal blog of things I try to remember I did when I look back at them years later.

guywhocodes · 9 months ago

This is par for Ollama, look at the log_probs issues/prs and you get an idea of how well Ollama is run.

Ollama is IMO a model downloader for llama.cpp so you can do roleplay with ease.

smcleod · 9 months ago

I'm going to be generous here and assume you didn't bother to actually read the post (or even the PR) before writing a snaky, non-constructive comment, but skimming through your HN comment history this appears to be on-brand.

wokwokwok · 9 months ago

I'll be generous and just say, maybe people should just use llama.cpp and not ollama if they care about having nice things, if merging support for existing features is that difficult.

It seems like it's probably a better choice overall.

That said, I'm sure people worked very hard on this, and it's nice to see it as a part of ollama for the people that use it.

Also:

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

https://news.ycombinator.com/newsguidelines.html

yard2010 · 9 months ago

"Judge not, that ye be not judged"

Deleted Comment