omneity (u/omneity) - Readit News

omneity commented on Claudia – Desktop companion for Claude code claudiacode.com/... · Posted by u/zerealshadowban

mikestaas · 7 days ago

Roo code in vs code, and qwen coder in lm studio is a decent local only combo.

omneity · 7 days ago

Strongly seconding Roo Code. I am using it in VSCodium and it's the perfect partner for a fully local coding workflow (nearly 100% open-source too so no vendor is going to pry it from my hand, "ever").

Qwen Coder 30B is my main driver in this configuration and in my experience is quite capable. It runs at 80 tok/s on my M3 Max and I'm able to use it for about 30-50% of my coding tasks, the most menial ones. I am exploring ways to RL its approach to coding so it fits my style a bit more and it's a very exciting prospect whenever I manage to figure it out.

The missing link is autocomplete since Roo only solves the agent part. Continue.dev does a decent job at that but you really want to pair it with a high performance, large context model (so it fits multiple code sections + your recent changes + context about the repo and gives fast suggestions) and that doesn't seem feasible or enjoyable yet in a fully local setup.

omneity commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 magazine.sebastianraschka... · Posted by u/ModelForge

7moritz7 · 14 days ago

Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.

In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.

So presumably, this comes down to...

- training technique or data

- dimension

- lower number of large experts vs higher number of small experts

omneity · 14 days ago

Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.

In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.

omneity commented on Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs baseten.co/blog/sota-perf... · Posted by u/philipkiely

codedokode · 18 days ago

By the way I wonder, what has more performance, a $25 000 professional GPU or a bunch of cheaper consumer GPUs costing $25 000 in total?

omneity · 18 days ago

Consumer GPUs in theory and by a large margin (10 5090s will eat an H100 lunch with 6 times the bandwidth, 3x VRAM and a relatively similar compute ratio), but your bottleneck is the interconnect and that is intentionally crippled to avoid beowulf GPU clusters eating into their datacenter market.

Last consumer GPU with NVLink was the RTX 3090. Even the workstation-grade GPUs lost it.

https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-...

omneity commented on My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air) simonwillison.net/2025/Ju... · Posted by u/simonw

simonw · a month ago

For this particular model, yes.

This new one from Qwen should fit though - it looks like that only needs ~30GB of RAM: https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Inst...

omneity · a month ago

It takes ~17-20GB on Q4 depending on context length & settings (running it as we speak)

~30GB in Q8 sure, but it's a minimal gain for double the VRAM usage.

omneity commented on GLM-4.5: Reasoning, Coding, and Agentic Abililties z.ai/blog/glm-4.5... · Posted by u/GaggiX

littlestymaar · a month ago

Then you need to reprocess the previous conversation from scratch when switching from one provider to another, which sounds very expensive for no reason.

omneity · a month ago

Conversations are always "reprocessed from scratch" on every message you send. LLMs are practically stateless and the conversation is the state, as in nothing is kept in memory between two turns.

omneity commented on Show HN: Price Per Token – LLM API Pricing Data pricepertoken.com/... · Posted by u/alexellman

yieldcrv · a month ago

the local side of things with an $7,000 - $10,000 machine (512gb fast memory, cpu and disk) can almost reach parity with regard to text input and output and 'reasoning', but lags far behind for multimodal anything: audio input, voice output, image input, image output, document input.

there are no out the box solutions to run a fleet of models simultaneously or containerized either

so the closed source solutions in the cloud are light years ahead and its been this way for 15 months now, no signs of stopping

omneity · a month ago

Would running vLLM in docker work for you, or do you have other requirements?

omneity commented on Qwen3-Coder: Agentic coding in the world qwenlm.github.io/blog/qwe... · Posted by u/danielhanchen

swyx · a month ago

> there's a good reason that no software engineer you know was writing code with Qwen 2.5.

this is disingenous. there are a bunch of hurdles to using open models over closed models and you know them as well as the rest of us.

omneity · a month ago

Also dishonest since the reason Qwen 2.5 got so popular is not so much paper performance.

omneity commented on Show HN: Any-LLM – Lightweight router to access any LLM Provider github.com/mozilla-ai/any... · Posted by u/AMeckes

omneity · a month ago

Crazy timing!

I shipped a similar abstraction for llms a bit over a week ago:

https://github.com/omarkamali/borgllm

pip install borgllm

I focused on making it Langchain compatible so you could drop it in as a replacement. And it offers virtual providers for automatic fallback when you reach rate limits and so on.

omneity commented on AI is killing the web – can anything save it? economist.com/business/20... · Posted by u/edward

LtWorf · a month ago

But without stackoverflow how do you think the AI will be able to reply about next year's new programming language?

omneity · a month ago

Eventually through experience and self-play with the technologies in question.

omneity commented on Robot metabolism: Toward machines that can grow by consuming other machines science.org/doi/10.1126/s... · Posted by u/XzetaU8

omneity · a month ago

Is it actually growing, or is it swarm-like assembly, such as a coral colony?

u/omneity

KarmaCake day1949September 1, 2018

About

Some of what I build:

https://herd.garden

- MCP servers for your agents powered by your favorite websites (3000 websites/MCP servers and counting)

- And a novel automation framework to control your own browser using a puppeteer-like API

https://sawalni.com

- The first AI to speak Moroccan

- A generative AI platform with custom-trained LLMs, NMTs, LID, semantic embedding and other kinds of models for low-resource languages, notably African and Arabic-related languages.

More & contact: https://omarkama.li/about

I'm always happy to chat so hit me up!

View Original