jilijeanlouis (u/jilijeanlouis)

jilijeanlouis commented on AI speech-to-text can hallucinate violent language news.cornell.edu/stories/... · Posted by u/belter

jilijeanlouis · 2 years ago

Having worked with Whisper for quite some time now, it's true that hallucinations can be a real pain point. Long pauses between sentences / silence and background noise make it worse, among other factors. Good news is, there are ways to mitigate and solve this.

jilijeanlouis commented on Apple introduces M4 chip apple.com/newsroom/2024/0... · Posted by u/excsn

ChuckMcM · 2 years ago

Something they should be able to do now, but do not seem to, is to allow you to train Siri to recognize exactly your voice and accent. Which is to say, to take the speech-to-text model that is listening and putting it into the Siri integration API, to both be 99.99% accurate for your speech and to recognize you and only you when it comes to invoking voice commands.

It could, if it chose to, continue to recognize all voices but at the same time limit the things the non-owner could ask for based on owner preferences.

jilijeanlouis · 2 years ago

This is really easy to do: it's just an embedding of your voice. So typically like 10/30 sec max of your voice to configure this. You already do a similar setup for faceId. I agree with you, I don't understand why they don't do it.

jilijeanlouis commented on Best Speech-to-text API with speaker diarization? · Posted by u/_nickanthony

jilijeanlouis · 2 years ago

Our API, Gladia, supports speaker diarization. We use a hybrid enterprise-grade ASR system for speech-to-text, with our own version of Whisper at its core, and state-of-the-art open source models for diarization. We process large audio files, and use a proprietary algorithm so that our users are not billed extra for duplicate audio channels, as many other providers do. Hope this helps. There's a free trial if you'd like to test, and here's our blog with more info: https://www.gladia.io/blog/gladia-speech-to-text-api-speaker...

jilijeanlouis commented on Ontario family doctor says new AI notetaking saved her job globalnews.ca/news/104635... · Posted by u/davidbarker

popinman322 · 2 years ago

Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.

Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.

There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

jilijeanlouis · 2 years ago

Language detection in the presence of strong accents is, in my opinion, one of the most under-discussed biases in AI.

Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.

The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.

MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.

However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).

With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.

In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.

In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.

If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.

jilijeanlouis commented on OpenVoice: Instant Voice Cloning github.com/myshell-ai/Ope... · Posted by u/tosh

ChildOfChaos · 2 years ago

Where are the best places to keep up with all of this? I'm very interested in this area as I want to use these tools to create things with and my own voice isn't great for this.

Speech to speech seems like it might be better than TTS to get it to be more natural, i've played around with some tools like RVC etc, but I feel like there are maybe a lot of great AI workflows I am missing amoungst all the AI noise, it's the interesting workflows and people doing interesting things with AI that I am more interested in.

jilijeanlouis · 2 years ago

Definitely twitter. This is where everything is announced and commented

jilijeanlouis commented on OpenVoice: Instant Voice Cloning github.com/myshell-ai/Ope... · Posted by u/tosh

throwthrowuknow · 2 years ago

A digital audio file is not even close to being proof of anything. Even without voice cloning you can easily edit, clip and compose audio into almost anything you want. It’s also not difficult to simply impersonate someone else’s manner of speaking with practice something that is commonly done by both amateurs and professional actors. The only thing that changes is the ease with which this can be done which should help everyone understand how unreliable such “proof” is.

jilijeanlouis · 2 years ago

There are actually a big opportunity for companies like loccus: https://www.loccus.ai/

jilijeanlouis commented on Camus, Albert and the Anarchists (2007) theanarchistlibrary.org/l... · Posted by u/TotalCrackpot

dotsam · 2 years ago

Me too. I’ve been impressed with some essays I’ve listened to via Open AI TTS. Much better than the librivox ones I’ve occasionally suffered through, and it’s only going to get better.

jilijeanlouis · 2 years ago

Did you try other providers such as 11labs or open source like voice craft or openvoice