The kind of psycho-bullshit that we should stay away from, and wouldn't happen if we respected each other. Coming from Microsoft is not surprising though.
Deleted Comment
Deleted Comment
The kind of psycho-bullshit that we should stay away from, and wouldn't happen if we respected each other. Coming from Microsoft is not surprising though.
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
Deleted Comment
Inference on a generic LLM may not be subject to these non-determinisms even on a GPU though, idk
Deleted Comment
Deleted Comment
> Yoshi and others -- please keep the feedback coming. We want to hear it, and we genuinely want to improve the product in a way that gives great defaults for the majority of users, while being extremely hackable and customizable for everyone else.
I think an issue with 2550 upvotes, more than 4 times of the second-highest, is very clear feedback about your defaults and/or making it customizable.