Moshi: A speech-text foundation model for real time dialogue

Tried it (used gibberish email address). It answers immediately/instantly/while you are still talking. But those are just filler sentences (cached answers?). Actual thing that you asked for is answered much later down the line, if it doesn't get stuck in a loop.

swyx · a year ago

yeah i tried this demo when it first came out and then again today. Not to be all Reflection 70B again but it just doesnt seem like the same weights was uploaded as was showed in their original demo from July https://the-decoder.com/french-ai-lab-kyutai-unveils-convers...

l-m-z · a year ago

Hi swyx, laurent from kyutai here. We actually used the online demo at moshi.chat for the live event (the original demo), so same quantization. We updated the weights on the online version since then to add support for more emotions but we haven't noticed it being worse. One thing to point out is that it takes time to get used to interact with the model, what tends to work, how to make it speak. The live event was far from perfect but we certainly used this experience. I would encourage you to try a bit the same kind of interaction we add on the live event and you should get similar results (though the model is very unpredictable so hard to be sure, you can see that some part of the live events definitely didn't work as expected).

huac · a year ago

One guess is that the live demo is quantized to run fast on cheaper GPUs, and that degraded the performance a lot.

imjonse · a year ago

They are too prestigious to try shumering it.

I've been building solutions for real-time voice -> llm -> voice output, and I think the most exciting part of what you're building is the streaming neural audio codec since you're never actually really able to stream STT with whisper.

However from a product point of view I wouldn't necessarily want to pipe that into an LLM and have it reply, I think in a lot of use-cases there needs to be a tool/function calling step before a reply. Down to chat with anyone reading this who is working along these lines!

edit: tincans as mentioned below looks excellent too

editedit: noooo apparently tincans development has ended, there's 10000% space for something in this direction - Chris if you read this please let me pitch you on the product/business use-cases this solves regardless of how good llms get...

malevolent-elk · a year ago

I've been playing around with this workflow too - I'm using a "streaming" setup with Whisper (chunking samples to start transcribing while a user is still talking), which pipes to Mistral 8B as a conversation arbiter to walk through a preset IVR tree which calls tools etc. The LLM isn't responding on its own though, just selecting nodes in the tree with canned TTS outputs.

There's a "pause length" parameter that tries to decide whether a user has finished talking before it passes transcripts to the LLM, nothing fancy. If you have any recs I'm still working through how to properly handle the audio input and whether a prompting setup can manage the LLM with enough fidelity to scrap the IVR tree. It works decently well, but lots of room for improvement

Jonovono · a year ago

Is this a client / server setup? What are you using for handling the streaming of audio? (daily, livekit, etc?)

huac · a year ago

> there needs to be a tool/function calling step before a reply

I built that almost exactly a year ago :) it was good but not fast enough - hence building the joint model.

Reubend · a year ago

Let me offer some feedback, since almost all of the comments here are negative. The latency is very good, almost too good since it seems to interrupt me often. So I think that's a great achievement for an open source model.

However, people here have been spoiled by incredibly good LLMs lately. And the responses that this model gives are nowhere need the high quality of SOTA models today in terms of content. It reminds me more of the 2019 LLMs we saw back in the day.

So I think you've done a "good enough" job on the audio side of things, and further focus should be entirely on the quality of the responses instead.

08d319d7 · a year ago

Wholeheartedly agree. Latency is good, nice tech (Rust! Running at the edge on a consumer grade laptop!). I guess a natural question is: are there options to transplant a “better llm” into moshi without degrading the experience.

aversis_ · a year ago

But tbh "better" is subjective here. Does the new LLM improve user interactions significantly? Seems like people get obsessed with shiny new models without asking if it’s actually adding value.

Kerbonut · a year ago

With flux, they have been able to separate out the unet. I wonder if something similar could be done here so parts of it can be swapped.

dsmurrell · a year ago

Same question here.

ignoramous · a year ago

Moshi is CC-BY. Another similar 7b (speech-text real-time conversational) model that was recently released under Apache v2: https://tincans.ai/slm3 / https://huggingface.co/collections/tincans-ai/gazelle-v02-65...

iandanforth · a year ago

Important distinction is that tincans is not speech to speech. It uses a separate turn/pause detection model and a text to speech final processing step.

johnsutor · a year ago

Lots of recent development in the speech-enabled LM space recently (see https://github.com/ictnlp/LLaMA-Omni, https://github.com/gpt-omni/mini-omni)

zackangelo · a year ago

Their inference server is written in Rust using huggingface’s Candle crate. One of the Moshi authors is also the primary author of Candle.

We’ve also been building our inference stack on top of Candle, I’m really happy with it.

baggiponte · a year ago

Super interested. Do you have an equivalent of vLLM? Did you have to rewrite batching, paged attention…?

Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That and multi-GPU inference have been the hardest things to build.

I’ll need to get paged attention working as well, but I think I can launch without it.

allanrbo · a year ago

Was looking for a demo of it on YouTube and fell over this hilarious one from a few months ago: https://youtu.be/coroLWOS7II?si=TeVghP_Zi0P9exQh . I’m sure it’s improved since :-)

Zenst · a year ago

Wow, it's so worth watching just for a laugh.

marci · a year ago

I'm sorry.

undefinedblog · a year ago

this video made my day, thanks for posting it

vessenes · a year ago

Interesting. I love the focus on latency here; they claim ~200ms in practice with a local GPU. It's backed by a 7B transformer model, so it's not going to be super smart. If we imagine a 70B model has like 1s latency, then there's probably a systems architecture that's got 1 or 2 intermediate 'levels' of response, something to cue you verbally "The model is talking now," something that's going to give a quick early reaction (7B / Phi-3 sized), and then the big model. Maybe you'd have a reconciliation task for the Phi-3 model: take this actually correct answer, apologize if necessary, and so on.

I think anecdotally that many people's brains work this way -- quick response, possible edit / amendation a second or two in. Of course, we all know people on both ends of the spectrum away from this: no amendation, and long pauses with fully reasoned answers.

smusamashah · a year ago

artsalamander · a year ago