Voxtral – Frontier open source speech understanding models

Voxtral – Frontier open source speech understanding models mistral.ai/news/voxtral...

homarp · 2 months ago

Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

GaggiX · 2 months ago

There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

ipsum2 · 2 months ago

24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)

azinman2 · 2 months ago

But it also includes world knowledge, can do tool calls, etc. It’s an omnimodel

qwertox · a month ago

Only the mini is meant for pure transcription. And with the tests I just did on their API, comparing to Whisper large, they are around three times faster, more accurate and cheaper.

24B is, as sibling comment says, an omni model, it can also do function calling.

kamranjon · 2 months ago

Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.

lostmsu · 2 months ago

Does it support realtime transcription? What is the ~latency?

rolisz · 2 months ago

Unlikely. The small model is much larger than whisper (which is already hard to use for realtime)

vivalapomy · a month ago

Won't comment on the 24B model as I see no use for it personally, but regarding purely ASR tasks, I honestly can't see voxtral taking off. For personal usage, I've been running a quant of whisper tiny(for english), as well as whisper small(for spanish, as is my native language), and have never experienced major latency when using for globally available voice commands. Considering my machine runs an Ivy Bridge processor, using CPU inference, the pricing seems unreasonable.

sheerun · 2 months ago

In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.

Raed667 · 2 months ago

It is insane how good the "French man speaking English" demo is. It captures a lot of subtleties

potlee · a month ago

That’s an actual French man speaking English

lostmsu · 2 months ago

My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.

ImageXav · 2 months ago

How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.

lostmsu · 2 months ago

Harvesting idle compute. https://borgcloud.org/speech-to-text