Seeing this title, I really hoped they'd have a parameter to force the language for voice recognition. Being Finnish with a heavy rally-Finnish accent, the real-time modes quite often end up transcribing my English speech as Finnish text, and strangely, I get the correct Finnish for the words for what I spoke in English. It might not happen during the first sentence, but after a few queries and replies it usually does.
According to the OpenAI forums this is a common problem. I see they've addressed this in the post by prompting the model to stick to one language, but previously this didn’t work consistently, and in their Playground the newest `User transcript model` is still the same as before (`gpt-4o-transcribe`), so I don’t have high hopes. Must be hard to implement.
edit: Tried it again (with a prompt requesting English like always). By my 6th message it suddenly started transcribing to Finnish, and after that it became more common. Better than it used to be, but in many ways still useless. Though I'm sure it works better for people with lighter accents.
I think it would be neat to hook up a (realtime) speech-to-speech model to something like Home Assistant for smart home controls + generic questions. HA has this feature, but is currently using a STT + text model + TTS pipeline, which works fine, but has higher delays and feels less... natural.
I've been using the voice chat in ChatGPT more and more frequently. I'm curious now to see how the costs associated with this would work through the API on some user-facing features. It's a cool update at a glance.
lower cost, higher quality, loving it so far. question. we run this voice proxy, but here the calls are started on our end typically: https://getstream.io/video/voice-agents/
Just happen have some stats on that (non-US context): 60% picks up a local number, about 40% picks up a foreign number (specifically the stat I have is a US number calling someone in a non-US geography).
According to the OpenAI forums this is a common problem. I see they've addressed this in the post by prompting the model to stick to one language, but previously this didn’t work consistently, and in their Playground the newest `User transcript model` is still the same as before (`gpt-4o-transcribe`), so I don’t have high hopes. Must be hard to implement.
edit: Tried it again (with a prompt requesting English like always). By my 6th message it suddenly started transcribing to Finnish, and after that it became more common. Better than it used to be, but in many ways still useless. Though I'm sure it works better for people with lighter accents.
I have a TestFlight beta for those who want to try it out, hope to have the new model included in the next beta build:
https://x.com/keleftheriou/status/1956932258293755955?s=46
with sip this logic changes. https://platform.openai.com/docs/guides/realtime-sip
sounds like we can listen to the webhook and start from there?
But for now yea waiting for Webhook is way to do it.
Outside of OpenAI lots of mechanisms exist stuff like STIR/SHAKEN[0]
[0] https://en.wikipedia.org/wiki/STIR/SHAKEN
More than I expected.