It could, if it chose to, continue to recognize all voices but at the same time limit the things the non-owner could ask for based on owner preferences.
Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.
There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.
The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.
MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.
However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).
With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.
In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.
In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.
If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.
Dead Comment
Speech to speech seems like it might be better than TTS to get it to be more natural, i've played around with some tools like RVC etc, but I feel like there are maybe a lot of great AI workflows I am missing amoungst all the AI noise, it's the interesting workflows and people doing interesting things with AI that I am more interested in.