Insanely Fast Whisper

One feature of Whisper I think people underuse is the ability to prompt the model to influence the output tokens. This can be used to correct spelling/context-dependent words. Some examples from my terminal history:

  ./main -m models/ggml-small.bin -f alice.wav --prompt "Audiobook reading by a British woman:"

  ./main -m models/ggml-small.bin -f output.wav --prompt "Research talk by Junyao, Harbin Institute of Technology, Shenzhen, research engineer at MaiMemo"

Also works multi-lingual. You can use this to influence transcription to produce traditional/simplified Chinese characters for instance.

Although I seem to have trouble to get the context to persist across hundreds of tokens. Tokens that are corrected may revert back to the model's underlying tokens if they weren't repeated enough.

coder543 · 2 years ago

“prompt” is arguably a misnomer. In other implementations, it is correctly called “initial-prompt”, and even the whisper.cpp help describes it as an initial prompt.

It only affects the first 30 second window of the transcription, as far as I’ve been able to tell. If the word in question appears in that window, then it will influence the next window, and so on… but as soon as it doesn’t exist in one 30 second window, it’s effectively gone, from what I’ve seen.

It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.

It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.

We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.

araes · 2 years ago

Why can Whisper not just reuse the prompt for every 30 second window? At least then it would be a full file "prompt".

opt1c · 2 years ago

> It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.

I'm not sure why you're so dismissive when real-time transcription is an important use-case that falls under that bucket of "quick snippets".

> It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.

I think it's more context-dependent than it is "hard". It's ideal for streaming meeting transcripts. In my use-cases, I use the prompt to feed in participant names, company terms/names, and other potential words. It's also much easier to just rattle off a list of potential words that you know are going to be in the transcription that are difficult or spelled differently.

> We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.

Prompting is infinitely easier than fine-tuning in every aspect. I can reuse the same model in any context and just swap out the prompt. I don't have to spend time/money finetuning... I don't have to store multiple fine-tuned copies of whisper for different contexts... I'm not sure what better solution you envision but fine-tuning is certainly not easier than prompting.

MaximilianEmel · 2 years ago

But it will influence the initial text generated, which influences the subsequent text as well. So it theoretically influences the whole thing, just diluted and indirectly.

mlsu · 2 years ago

I wonder if anyone is working on infusing the entire transcript with a "prompt" this way, it seems like a no brainer that would significantly improve accuracy. We already do it with languages, why not with concepts?

Havoc · 2 years ago

That’s neat. Could be useful in a home automation context - give it a bit of clues about the environment and tech

freedomben · 2 years ago

whoa, thank you! I did not know about this, and this is super helpful. I've been playing around a lot with whisper.cpp and I have some situations where this might improve things dramatically

The submission link is weird. It has far fewer stars than the repo it is forked from, and could just be an ad for replicate.com?

It is missing the most recent commits from what appears to be the real source: https://github.com/Vaibhavs10/insanely-fast-whisper

The only added commit is adding a replicate.com example, whatever that means.

bfirsh · 2 years ago

Founder of Replicate here. We open pull requests on models[0] to get them running on Replicate so people can try out a demo of the model and run them with an API. They're also packaged with Cog[1] so you can run them as a Docker image.

Somebody happened to stumble across our fork of the model and submitted it. We didn't submit it nor intend for it to be an ad. I hope the submission gets replaced with the upstream repo so the author gets full credit. :)

[0] https://github.com/Vaibhavs10/insanely-fast-whisper/pull/42

[1] https://github.com/replicate/cog

idonotknowwhy · 2 years ago

I'm curious, How did you know about this thread here? I've seen this happen where a blog or site is mentioned and the author shows up. It's there software to monitor when you're mentioned on HN or did you just happen to browse it?

dang · 2 years ago

Ok, we've changed to that from https://github.com/chenxwh/insanely-fast-whisper. Thanks!

siraben · 2 years ago

kamranjon · 2 years ago

I'm sort of confused - is this just a CLI wrapper around faster-whisper, transformers and distil-whisper? Will this be any faster than running those by themselves? There doesn't seem to be much code here, so this is why I'm wondering if this is actually something to get excited about if I already am aware of those projects.

Edit: Also this seems a bit suspicious - this seems like someone just forked another persons active repo (https://github.com/Vaibhavs10/insanely-fast-whisper) and posted as their own?

aguynamedben · 2 years ago

Not sure if the forker made any improvements, but I saw this first a few days ago here: https://twitter.com/reach_vb/status/1723810943329616007?s=12...

See https://news.ycombinator.com/item?id=38267524. We've changed the URL now. Thanks!

pen2l · 2 years ago

Transcription speed and accuracy keeps going up and it’s delightful to see the progress, I wish though more effort was dedicated to creating integrated solutions that could accurately transcribe with speaker diarization.

Is diarization only possible with stereo audio at the moment in whisper? If the voices aren't left/right split I don't think one can get them separated yet.

copypirate · 2 years ago

https://github.com/Wordcab/wordcab-transcribe

Author of Wordcab-Transcribe here. We use faster-whisper + NeMo for diarization, if you want to take a look.

theossuary · 2 years ago

Nvidia's Nemo has some support for speaker recognition and speaker diarization too.

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en...

Whisper cannot perform speaker diarization at the moment, WhisperX and other solutions that can tend to use pyannot https://github.com/pyannote/pyannote-audio

atmosx · 2 years ago

Same here. Whisper is really good at transcribing Greek but no diarization support, which makes it less than ideal for most use cases.

userhacker · 2 years ago

I'm the creator of Revoldiv.com, We do speaker diarization and transcription at the same time. Give it a try.

danso · 2 years ago

So what's in the secret sauce? e.g. faster-whisper "is a reimplementation of OpenAI's Whisper model using CTranslate2" and claims 4x the speed of whisper; what does insanely-fast-whisper do to achieve its gains?

sp332 · 2 years ago

Why do none of the benchmarks in the table match the headline?

TOMDM · 2 years ago

The closest I could find is someone claiming to get a similar time on their RTX3090 on an issue in the repo.

https://github.com/Vaibhavs10/insanely-fast-whisper/issues/3...

nodja · 2 years ago

The table is from benchmarks ran on a T4 GPU, that's roughly a GTX 1080 levels of performance. More recent GPUs are vastly faster (5-10x).

lartin_muther · 2 years ago

coming soon: Careless Whisper (transcribes audio but every few minutes it goes "idk here they said something or other")

The current whisper says to like and subscribe or thanks for watching the video when it doesn't know

refulgentis · 2 years ago

Flagged, fork of project that launched last week that did this and had its own HN story.