red2awn (u/red2awn) - Readit News

Readit News

red2awn commented on Nvidia Nemotron 3 Family of Models research.nvidia.com/labs/... · Posted by u/ewt-nv

red2awn · 4 hours ago

Very interesting release:

* Hybrid MoE: 2-3x faster than pure MoE transformers

* 1M context length

* Trained on NVFP4

* Open Source! Pretraining, mid-training, SFT and RL dataset released (SFT HF link is 404...)

* Open model training recipe (coming soon)

Really appreciate Nvidia being the most open lab but they really should make sure all the links/data are available on day 0.

Also interesting that the model is trained in NVFP4 but the inference weights are FP8.

red2awn commented on Ask HN: How can I get better at using AI for programming? · Posted by u/lemonlime227

Fire-Dragon-DoL · 2 days ago

I find all AI code to be lower quality than humans who care about quality. This might be ok, I think the assumpt with AI is that we don't need to look at code so that it looks beautiful because AI will look at it .

red2awn · 2 days ago

Opus 4.5 is the highest quality code I've seen out of LLMs, still some way to go to match programmers who care, but much better than most people. I find it enough to let it write the code and then manually polish it afterwards.

red2awn commented on Ask HN: How can I get better at using AI for programming? · Posted by u/lemonlime227

kidbomb · 2 days ago

Does the same happens if I create an AGENTS.md instead?

red2awn · 2 days ago

Claude Code does not support AGENTS.md, you can symlink it to CLAUDE.md to workaround it. Anthropic: pls support!

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

whimsicalism · 5 days ago

Makes sense, I think streaming audio->audio inference is a relatively big lift.

red2awn · 5 days ago

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

AndreSlavescu · 5 days ago

We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.

Check it out here: https://models.hathora.dev/model/qwen3-omni

red2awn · 5 days ago

Nice work. Are you working on streaming input/output?

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

banjoe · 5 days ago

Wow, crushing 2.5 Flash on every benchmark is huge. Time to move all of my LLM workloads to a local GPU rig.

red2awn · 5 days ago

Why would you use an Omni model for text only workload... There is Qwen3-30B-A3B.

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

plipt · 5 days ago

Thanks

Was it being closed weight obvious to you from the article? Trying to understand why I was confused. Had not seen the "Flash" designation before

Also 30B models can beat a semi-recent 235B with just some additional training?

red2awn · 5 days ago

They had a Flash variant released alongside the original open weight release. It is also mentioned in Section 5 of the paper: https://arxiv.org/pdf/2509.17765

For the evals it's probably just trained on a lot of the benchmark adjacent datasets compared to the 235B model. Similar thing happened on other model today: https://x.com/NousResearch/status/1998536543565127968 (a 30B model trained specifically to do well in maths get near SOTA scores)

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

coder543 · 5 days ago

Based on things I had read over the past several months, Qwen3-Flash seemed to just be a weird marketing term for the Qwen3-Omni-30B-A3B series, not a different model. If they are not the same, then that is interesting/confusing.

red2awn · 5 days ago

It is an in-house closed weight model for their own chat platform, mentioned in Section 5 of the original paper: https://arxiv.org/pdf/2509.17765

I've seen it in their online materials too but can't seem to find it now.

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

plipt · 5 days ago

I dont think the Flash model discussed in the article is 30B

Their benchmark table shows it beating Qwen3-235B-A22B

Does "Flash" in the name of a Qwen model indicate a model-as-a-service and not open weights?

red2awn · 5 days ago

Flash is a closed weight version of https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct (it is 30B but with addtional training on top of the open weight release). They deploy the flash version on Qwen's own chat.

red2awn commented on Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model qwen.ai/blog?id=qwen3-omn... · Posted by u/pretext

sosodev · 5 days ago

Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

red2awn · 5 days ago

None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.

u/red2awn

KarmaCake day413November 9, 2016

About

Thomas Ip

View Original