It's amazing how much can be done, but anything that I think will take an hour, lasts 6. Many times into the night.
Deleted Comment
Also take a peek at what's in Memories (which is separate from the above); consider cleaning it up or disabling entirely.
"Sure, I'll help you stop flirting with OOMs"
"Thought for 27s Yep-..." (this comes out a lot)
"If you still graze OOM at load"
"how far you can push --max-model-len without more OOM drama"
- all this in a prolonged discussion about CUDA and various llm runners. I've added special user instructions to avoid flowery language, but it gets ignored.
EDIT: it also dragged conversation for hours. I ended up going with latest docs and finally, all issues with CUDA in a joint tabbyApi and exllamav2 project cleared up. It just couldn't find a solution and kept proposing, whatever people wrote in similar issues. It's reasoning capabilities are in my eyes greatly exaggarated.
Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?
Was able to create a sample page, tried starting a server, recognising a leftover server was running, killing it (and forced a prompt for my permission), retrying and finding out it's ip for me to open in the browser.
This isn't a demo anymore. That's actually very useful help for interns/juniors already.