Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.
Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.
I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.
Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.
It was so bad I told several people not to invest in what was back then a national tech darling:
> I tried speaking in 2 languages at once, and it picked it up correctly.
I'm a native french speaker and I tried with a very simple sentence mixing french and english:
"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")
And instead I got this:
"Je prépare un redote, mais pour une carabine, je préfère un ACOG."
"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.
I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.
Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".
Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.
Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.
I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt
If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.
The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).
There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!
I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!
Hey, I would really appreciate if you would try https://ottex.ai
I'm working on a Wispr/Spokenly competitor. It's free without any paywalled features, supports local models and a bunch of API providers including Mistral.
For local models ottex has - parakeet V3, Whisper, GLM-ASR nano, Qwen3-ASR (don't have voxtral yet though, looking into it).
btw, you can try new voxtral model via API (the model name to pick is `voxtral-mini-latest:transcribe`), I personally switched to it as my main default fast model - it's really good.
Same here; the voice waveform animates as expected but the model doesn't do anything when I click on the microphone. It just says "Error" in the upper-right corner.
Also tried downloading and running locally, no luck. Same behavior.
Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.
Yeah it messed up a bit for me too when I didn't enunciate well. If I speak clearly it seems to work very well even with background noise. Remember Dragon Naturally Speaking? Imagine having this back then!
Is it 0.003 per minute of audio uploaded, or "compute minute"?
For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.
In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
They don't claim to support Polish, but they do support Russian.
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.
I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Yeah, it's too bad. Apparently it only performs well in certain languages: "The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch"
That's a mix of Polish and Ukrainian in the transcript. Now, if I try speaking Ukrainian, I'm getting transcript in Russian every time. That's upsetting.
Oh no! The model won't translate to an unsupported language, and incorrectly reverts to one that it was explicitly trained on.
The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.
Cracking non-English or accented / mispronounced English is the white whale of text-to-speech I think; I don't know about you, but in our day to day chats there's a lot of jargon, randomly inserted English words, etc. And when they speak in English it's often what I call expat-English which is what you get when non-native speakers only speak the language with other non-native speakers.
Add poor microphone quality (using a laptop to broadcast a presentation to a room audience isn't very good) and you get a perfect storm of untranscribeable presentations or meetings.
All I want from e.g. Teams is a good transcript and, more importantly, a clever summary. Because when you think about it, imagine all the words spoken in a meeting and write them down - that's pages and pages of content that nobody would want to read in full.
I'm not sure why but their multilingual performance in general has usually been below average. For a French company, their models are not even close to being best in French, even outdone by the likes of Qwen. I don't think they're focusing on anything but English, the rest is just marketing.
polish logically should be rendered in cyrillic as the cyrillic orthography more closely matches the sounds and consonant structure of slavic languages like polish and russian, although this has never been done for church reasons . maybe this is confusing ai
I'm so amazed to find out just how close we are to the start trek voice computer.
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
Thank you for sharing! Does your implementation allow running the Nemotron model on Vulkan? Like whisper.cpp? I'm curious to try other models, but I don't have Nvidia, so my choices are limited.
I’m curious about this too. On my M1 Max MacBook I use the Handy app on macOS with Parakeet V3 and I get near instant transcription, accuracy slightly less than slower Whisper models, but that drop is immaterial when talking to CLI coding agents, which is where I find the most use for this.
Yeah, I think the multilingual improvements in V3 caused some kind of regression for English - I've noticed large blocks occasionally dropped as well, so reverted to v2 for my usage. Specifically nvidia/parakeet-tdt-0.6b-v2 vs nvidia/parakeet-tdt-0.6b-v3
I didn’t see that but I do get a lot of stutters (words or syllables repeated 5+ times), not sure if it’s a model problem or post processing issue in the Handy app.
Parakeet is really good imo too, and it's just 0.6B so it can actually run on edge devices. 4B is massive, I don't see Voxtral running realtime on an Orin or fitting on a Hailo. An Orin Nano probably can't even load it at BF16.
I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.
Most English speakers likely would understand those and don’t speak French or Spanish. So it’s not necessary to tack on extra languages even if there are loan words.
In general there is a concept called the “curse of multilinguality”
It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.
yeah, one example I run into is getting my perplexity phone assistant to play a song in spanish. I cannot for the life of me get a model to translate:
"Play señorita a mi me gusta su style on spotify" correctly
Everything is a tradeoff, and different use cases require different tradeoffs:
Option A: this model
Option B: faster model, only 1 language
Option C: same size model, only 1 language but higher quality
My point is that option A isn’t always best.
And on the borrowed words bit, there’s no rule that we cannot add borrowed words into the vocab. But you don’t need the whole language. I know what deja voux means but I don’t speak French.
It’s a little bit like asking for everything to be included in the Standard Library. Sure, it sounds nice at first, but now you need to maintain tons of dependencies. And any time you want to do one thing, you bring along the baggage of every other thing.
Languages are similar. They also change over time. So now if you want to release a v2 you need an updated corpus for all languages. Or if you get access to an updated corpus for a small language, it might not merit a new model version since it’s only one out of the 14.
But I actually think that one if the bigger arguments for single language models is the ability to have more languages. Im from Sweden, so I would like to have swedish on extremly high level, but I wouldnt like to have all other small languages on that level beacuse it would inflate the size. So, I actually think having multiple single language models, make it wider and deeper
honestly the inability to correctly transcribe the 4 language mix i use in my everyday life is one of the major blockers for adopting ASR tech in my own tooling. this coming from someone who literally works in that field.
turns out, outside the US, many people speak more than one language. :)
edit: I should say was a major blocker, because the last iterations of open-weight models actually work better and better. it's often the UX that's not thought for these usecases.
A single language modèle wouldn't make any sense except for English: there's simply too much English intertwined with any other language nowadays (corporate jargon, brands, tech, etc.)
STT services that have been around for longer, like Azure, Google and Amazon, generally require you to request a specific language, and their quality is a lot higher than models that advertise themselves as LLMs (even though I believe the clouds are also using the same types of models now).
Engineering is about tradeoffs. If the model is being used in an English-only context then tacking on 13 other languages might not be worth the cost.
You are also implicitly choosing worse performance in English by adding extra languages. So you could have a better monolingual model for the same number of weights.
They've already done the inverse and trimmed non-coding abilities from their language model: https://openai.com/index/introducing-gpt-5-2-codex/. There's already precedent for creating domain-specific models.
I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.
That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.
uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.
Even model size, it’s modest. There is a lot of machinery that is going to be common for all languages. You don’t multiply model size by 2 when you double the number of supported languages.
Well for example the last step is to softmax over all output logits, which is the same as your vocab size. You need the sum of the exponentiated values of each logit to calculate the denominator which is O(N).
Bigger impact is before that you need to project the hidden state matrix to the vocab list. Something like 4096x250000. Bigger vocab=more FLOPs.
If you’re on a GPU things are parallelized so maybe it’s not quite linear if everything fits nicely. But on a cpu you’re going to struggle more.
This is why the juiciest target when shrinking models is the token embedding table. For example AlBERT factorized the whole embedding table to two low rank matrices.
If encoding more learned languages and grammars and dictionaries makes the model size bigger, it will also increase latency. Try running a 1B model locally and then try to run a 500B model on the same hardware. You'll notice that latency has rather a lot to do with model size.
Wow, Voxtral is amazing. It will be great when someone stitches this up so an LLM starts thinking, researching for you, before you actually finish talking.
Like, create a conversation partner with sub 0.5 second latency. For example, you ask it a multi part questions and, as soon as you finish talking, it gives you the answer to the first part while it looks up the rest of the answer, then stitches it together so that there's no break.
The 2-3 second latency of existing voice chatbots is a non-started for most humans.
I noticed that with both models voxtral-mini-transcribe-realtime-2602 and voxtral-mini-2602 filler words are ignored. I'd like to be able to count words/sounds, specifically "um" or "uh" for improvement purposes. Any good models that handle that?
Incroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..
Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?
And open weight too! So grateful for this.
I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.
Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.
It was so bad I told several people not to invest in what was back then a national tech darling:
https://en.wikipedia.org/wiki/Lernout_%26_Hauspie
That turned out to be a massive fraud.
But ...
> I tried speaking in 2 languages at once, and it picked it up correctly.
I'm a native french speaker and I tried with a very simple sentence mixing french and english:
"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")
And instead I got this:
"Je prépare un redote, mais pour une carabine, je préfère un ACOG."
"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.
I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.
I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt
If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.
The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).
There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!
I'm working on a Wispr/Spokenly competitor. It's free without any paywalled features, supports local models and a bunch of API providers including Mistral.
For local models ottex has - parakeet V3, Whisper, GLM-ASR nano, Qwen3-ASR (don't have voxtral yet though, looking into it).
btw, you can try new voxtral model via API (the model name to pick is `voxtral-mini-latest:transcribe`), I personally switched to it as my main default fast model - it's really good.
Also tried downloading and running locally, no luck. Same behavior.
But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.
Deleted Comment
Model is around 7.5 GB - once they get above 4 GB running them in a browser gets quite difficult I believe.
Dead Comment
Dead Comment
Dead Comment
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.
I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?
edit: I stand corrected lol. I'll go with "Gaelic" instead.
I guess a European version can be created but now it's aimed at a world wide distribution.
Try sticking to the supported languages
The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.
Add poor microphone quality (using a laptop to broadcast a presentation to a room audience isn't very good) and you get a perfect storm of untranscribeable presentations or meetings.
All I want from e.g. Teams is a good transcript and, more importantly, a clever summary. Because when you think about it, imagine all the words spoken in a meeting and write them down - that's pages and pages of content that nobody would want to read in full.
Polish works with the Latin alphabet just fine.
"Do kraju tego, gdzie kruszynę chleba podnoszą z ziemi przez uszanowanie dla darów Nieba.... Tęskno mi, Panie..."
"Mimozami jesień się zaczyna, złotawa, krucha i miła. To ty, to ty jesteś ta dziewczyna, która do mnie na ulicę wychodziła."
That's not the case. Polish uses Latin-like alphabet due to Czech influence and German printers.
https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...
https://github.com/m1el/nemotron-asr.cpphttps://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
[1] https://github.com/cjpais/Handy
https://github.com/cjpais/Handy
https://aclanthology.org/2025.findings-acl.87/
For example, "here it is, voila!" "turn left on el camino real"
In general there is a concept called the “curse of multilinguality”
https://arxiv.org/pdf/1911.02116
Option A: this model
Option B: faster model, only 1 language
Option C: same size model, only 1 language but higher quality
My point is that option A isn’t always best.
And on the borrowed words bit, there’s no rule that we cannot add borrowed words into the vocab. But you don’t need the whole language. I know what deja voux means but I don’t speak French.
Languages are similar. They also change over time. So now if you want to release a v2 you need an updated corpus for all languages. Or if you get access to an updated corpus for a small language, it might not merit a new model version since it’s only one out of the 14.
turns out, outside the US, many people speak more than one language. :)
edit: I should say was a major blocker, because the last iterations of open-weight models actually work better and better. it's often the UX that's not thought for these usecases.
You are also implicitly choosing worse performance in English by adding extra languages. So you could have a better monolingual model for the same number of weights.
I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.
That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.
Bigger impact is before that you need to project the hidden state matrix to the vocab list. Something like 4096x250000. Bigger vocab=more FLOPs.
If you’re on a GPU things are parallelized so maybe it’s not quite linear if everything fits nicely. But on a cpu you’re going to struggle more.
This is why the juiciest target when shrinking models is the token embedding table. For example AlBERT factorized the whole embedding table to two low rank matrices.
Wow, Voxtral is amazing. It will be great when someone stitches this up so an LLM starts thinking, researching for you, before you actually finish talking.
Like, create a conversation partner with sub 0.5 second latency. For example, you ask it a multi part questions and, as soon as you finish talking, it gives you the answer to the first part while it looks up the rest of the answer, then stitches it together so that there's no break.
The 2-3 second latency of existing voice chatbots is a non-started for most humans.
I noticed that with both models voxtral-mini-transcribe-realtime-2602 and voxtral-mini-2602 filler words are ignored. I'd like to be able to count words/sounds, specifically "um" or "uh" for improvement purposes. Any good models that handle that?
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.