One feature of Whisper I think people underuse is the ability to prompt the model to influence the output tokens. This can be used to correct spelling/context-dependent words. Some examples from my terminal history:
./main -m models/ggml-small.bin -f alice.wav --prompt "Audiobook reading by a British woman:"
./main -m models/ggml-small.bin -f output.wav --prompt "Research talk by Junyao, Harbin Institute of Technology, Shenzhen, research engineer at MaiMemo"
Also works multi-lingual. You can use this to influence transcription to produce traditional/simplified Chinese characters for instance.
Although I seem to have trouble to get the context to persist across hundreds of tokens. Tokens that are corrected may revert back to the model's underlying tokens if they weren't repeated enough.
“prompt” is arguably a misnomer. In other implementations, it is correctly called “initial-prompt”, and even the whisper.cpp help describes it as an initial prompt.
It only affects the first 30 second window of the transcription, as far as I’ve been able to tell. If the word in question appears in that window, then it will influence the next window, and so on… but as soon as it doesn’t exist in one 30 second window, it’s effectively gone, from what I’ve seen.
It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.
It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.
> It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.
I'm not sure why you're so dismissive when real-time transcription is an important use-case that falls under that bucket of "quick snippets".
> It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
I think it's more context-dependent than it is "hard". It's ideal for streaming meeting transcripts. In my use-cases, I use the prompt to feed in participant names, company terms/names, and other potential words. It's also much easier to just rattle off a list of potential words that you know are going to be in the transcription that are difficult or spelled differently.
> We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.
Prompting is infinitely easier than fine-tuning in every aspect. I can reuse the same model in any context and just swap out the prompt. I don't have to spend time/money finetuning... I don't have to store multiple fine-tuned copies of whisper for different contexts... I'm not sure what better solution you envision but fine-tuning is certainly not easier than prompting.
But it will influence the initial text generated, which influences the subsequent text as well. So it theoretically influences the whole thing, just diluted and indirectly.
I wonder if anyone is working on infusing the entire transcript with a "prompt" this way, it seems like a no brainer that would significantly improve accuracy. We already do it with languages, why not with concepts?
whoa, thank you! I did not know about this, and this is super helpful. I've been playing around a lot with whisper.cpp and I have some situations where this might improve things dramatically
Founder of Replicate here. We open pull requests on models[0] to get them running on Replicate so people can try out a demo of the model and run them with an API. They're also packaged with Cog[1] so you can run them as a Docker image.
Somebody happened to stumble across our fork of the model and submitted it. We didn't submit it nor intend for it to be an ad. I hope the submission gets replaced with the upstream repo so the author gets full credit. :)
I'm curious, How did you know about this thread here? I've seen this happen where a blog or site is mentioned and the author shows up. It's there software to monitor when you're mentioned on HN or did you just happen to browse it?
I'm sort of confused - is this just a CLI wrapper around faster-whisper, transformers and distil-whisper? Will this be any faster than running those by themselves? There doesn't seem to be much code here, so this is why I'm wondering if this is actually something to get excited about if I already am aware of those projects.
Transcription speed and accuracy keeps going up and it’s delightful to see the progress, I wish though more effort was dedicated to creating integrated solutions that could accurately transcribe with speaker diarization.
Is diarization only possible with stereo audio at the moment in whisper? If the voices aren't left/right split I don't think one can get them separated yet.
So what's in the secret sauce? e.g. faster-whisper "is a reimplementation of OpenAI's Whisper model using CTranslate2" and claims 4x the speed of whisper; what does insanely-fast-whisper do to achieve its gains?
Although I seem to have trouble to get the context to persist across hundreds of tokens. Tokens that are corrected may revert back to the model's underlying tokens if they weren't repeated enough.
It only affects the first 30 second window of the transcription, as far as I’ve been able to tell. If the word in question appears in that window, then it will influence the next window, and so on… but as soon as it doesn’t exist in one 30 second window, it’s effectively gone, from what I’ve seen.
It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.
It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.
I'm not sure why you're so dismissive when real-time transcription is an important use-case that falls under that bucket of "quick snippets".
> It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
I think it's more context-dependent than it is "hard". It's ideal for streaming meeting transcripts. In my use-cases, I use the prompt to feed in participant names, company terms/names, and other potential words. It's also much easier to just rattle off a list of potential words that you know are going to be in the transcription that are difficult or spelled differently.
> We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.
Prompting is infinitely easier than fine-tuning in every aspect. I can reuse the same model in any context and just swap out the prompt. I don't have to spend time/money finetuning... I don't have to store multiple fine-tuned copies of whisper for different contexts... I'm not sure what better solution you envision but fine-tuning is certainly not easier than prompting.
It is missing the most recent commits from what appears to be the real source: https://github.com/Vaibhavs10/insanely-fast-whisper
The only added commit is adding a replicate.com example, whatever that means.
Somebody happened to stumble across our fork of the model and submitted it. We didn't submit it nor intend for it to be an ad. I hope the submission gets replaced with the upstream repo so the author gets full credit. :)
[0] https://github.com/Vaibhavs10/insanely-fast-whisper/pull/42
[1] https://github.com/replicate/cog
Edit: Also this seems a bit suspicious - this seems like someone just forked another persons active repo (https://github.com/Vaibhavs10/insanely-fast-whisper) and posted as their own?
Author of Wordcab-Transcribe here. We use faster-whisper + NeMo for diarization, if you want to take a look.
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en...
https://github.com/Vaibhavs10/insanely-fast-whisper/issues/3...